lattice graphics

lattice graphics

Borrowed from 2012 instance of STAT 540 Statistics for High Dimensional Biology. Hence the use of gene expression data instead of Gapminder. There will be some broken links, I'm sure, but you'll get the main ideas.

Go here for an overview of the R graphics landscape and links to good reference material. This tutorial is an introduction to the add-on package lattice, which is included in all binary distributions of R. If it is not already loaded, load it now:

library(lattice)

Remember you may need to edit the file paths below, to reflect your working directory and local file storage choices.

We will work with a mouse dataset, containing "gene expression profiles of purified photoreceptors at distinct developmental stages and from different genetic backgrounds". The microarray platform was Affymetrix mouse genomic expression array 430 2.0. For more information on what we'll call the photoRec dataset, go here, especially the README.txt.

The full data matrix contains expression data for almost 30K genes/probesets! At first, we will work with a small friendly excerpt. In short, here's the pre-processing we've done for you:

Grab gene expression data for 3 probesets; these are rows in the original data matrix.
Transpose this data, i.e. reshape into 3 columns or variables, and store in a data.frame.
Glue these together with information about genotype and developmental stage of the samples, i.e. add more columns or variables.
Replace the arcane Affymetrix probeset names with arbitrary Pokemon attacks to make it easier to talk about results and figures below. We will just talk about these as if they are genes.

We load this mini dataset from a saved R object, in order to preserve factor levels that were set rationally (and non-alphabetically) during data cleaning and pre-processing. To get the gory details on the how and why of this sort of data handling, go here. FYI, the plain text version is available in the file GSE4051_MINI.txt.

When this class actually ran, my web set-up was different. I've applied a quick fix to Just. Get. The. Data. In.

#I don't want to load from plain text because I'd need to fiddle with factor
#levels. Here's some code that would help you do that, though.
#miniDat <- read.table("GSE4051_MINI.txt",header=TRUE,row.names=1) 
#str(miniDat)

## data import from URL
## kdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/photoRec/GSE4051_MINI.robj"
## load(kdURL)
## OK ... that did not work

## let's try with the dput output
kdURL <- "http://www.stat.ubc.ca/%7Ejenny/notOcto/STAT545A/examples/photoRec/GSE4051_MINI_DPUT.txt" 
kDat <- dget(kdURL)
str(kDat)

## 'data.frame':    39 obs. of  6 variables:
##  $ sample    : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage  : Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2..
##  $ gType     : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
##  $ crabHammer: num  10.22 10.02 9.64 9.65 8.58 ...
##  $ eggBomb   : num  7.46 6.89 6.72 6.53 6.47 ...
##  $ poisonFang: num  7.37 7.18 7.35 7.04 7.49 ...

table(kDat$devStage)

## 
##     E16      P2      P6     P10 4_weeks 
##       7       8       8       8       8

table(kDat$gType)

## 
##    wt NrlKO 
##    20    19

with(kDat, table(devStage, gType))

##          gType
## devStage  wt NrlKO
##   E16      4     3
##   P2       4     4
##   P6       4     4
##   P10      4     4
##   4_weeks  4     4

We see there are 39 samples (rows or observations). We have

sample number: integers between 1 and 39
developmental stage of the mouse: a factor with levels E16 (day 16 of embryonic development), P2 (postnatal day 2), P6 (postnatal day 6), P10 (postnatal day 10), 4_weeks (4 weeks postnatal)
genotype of the mouse: wt (wild type) or NrlKO (gene Nrl has been knocked out)
gene expression level (in some abstract unit-less sense) for crabHammer, eggBomb, and poisonFang

It's pretty clear the intent was to have 4 mice at each combination of genotype and developmental stage, but there was some mishap for the embryonic knockouts, where we only have data from 3 mice.

Scatterplots

The most important plot type is arguably the scatterplot. The lattice function for this is xyplot(). Let's plot the expression data for two of these genes against each other using the lattice function xyplot().

xyplot(eggBomb ~ crabHammer, kDat)

lattice graphing functions, like many other functions in R, utilize a special formula syntax and that is what is expected as the first inputted argument. The general form is y ~ x, read as "y twiddle x". Here it requests that the first variable, eggBomb, be associated with the y direction and the second, crabHammer, be associated with x. The second argument kDat specifies the data.frame where the variables eggBomb and crabHammer can be found. It is being matched against the formal argument data by its position. As with the formula syntax, the use of a data argument, usually in the second position, is common in R, though sadly not universal. Take advantage of it whenever you can!

You try: request a scatterplot of the variable poisonFang against crabHammer.

Let's imagine that crabHammer is somehow a natural explanatory variable or predictor (weird here, I admit, but go with me) and eggBomb and poisonFang are natural response variables. We might want to see both responses plotted against crabHammer at the same time. Here is a first way to do so, using a bit of a cheat known as the "extended formula interface" in lattice.

xyplot(eggBomb + poisonFang ~ crabHammer, kDat, auto.key = TRUE)

The + sign between the two response variables is, in this context, purely artificial, i.e. we don't literally add them up and scatterplot the sum against crabHammer. Read it as "and". We've added auto.key = TRUE which produces the legend at the top and is a great convenience. Custom legends are absolutely possible but don't go there until you need to!

What if we want each response to have it's own scatterplot, but we want to put them side-by-side for comparison?

xyplot(eggBomb + poisonFang ~ crabHammer, kDat, outer = TRUE, grid = TRUE)

The addition of outer = TRUE has caused the information to be presented in two panels, which is the lattice term for the individual plotting cells. The addition of a background grid, via grid = TRUE, is a great help for viewers who want to make comparisons across panels.

What if we'd like to know which points are from the wild type mice versus the Nrl knockouts?

xyplot(eggBomb + poisonFang ~ crabHammer, kDat, outer = TRUE, grid = TRUE, groups = gType, 
    auto.key = TRUE)

The addition of groups = gType is what gave us the visual cues for genotype and our old friend auto.key = TRUE is back. To create a comparable figure with base R graphics would take much more than one call, consisting of a mere ~100 characters. We could tweet this. This sort of "win" is typical and it's why R users who are serious about their figures generally use lattice and/or ggplot2.

The more proper way to make this last figure requires us to reshape the data. The lattice paradigm is for panels to correspond to a level of a factor or to a unique combination of levels of two or more factors.

Since reshaping is a topic in and of itself, we present the reshaping code without explanation here.

nDat <- with(kDat, data.frame(sample, devStage, gType, crabHammer, probeset = factor(rep(c("eggBomb", 
    "poisonFang"), each = nrow(kDat))), geneExp = c(eggBomb, poisonFang)))
str(nDat)

## 'data.frame':    78 obs. of  6 variables:
##  $ sample    : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage  : Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2..
##  $ gType     : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
##  $ crabHammer: num  10.22 10.02 9.64 9.65 8.58 ...
##  $ probeset  : Factor w/ 2 levels "eggBomb","poisonFang": 1 1 1 1 1 1 1 1..
##  $ geneExp   : num  7.46 6.89 6.72 6.53 6.47 ...

Our reshaped dataset nDat has 78 observations of 6 variables, whereas kDat had 39 = 78 / 2 observations of 6 variables. The variables sample, devStage, gType, and crabHammer have simply been repeated twice, resulting 39 * 2 = 78 rows. The variables eggBomb and poisonFang have been concatenated into a new variable geneExp with a new factor probeset identifying which gene the data is for.

Now we can make the previous plot with more canonical lattice syntax, i.e. this workflow and way of thinking will serve you better in the future:

xyplot(geneExp ~ crabHammer | probeset, nDat, grid = TRUE, groups = gType, auto.key = TRUE)

We are using another feature of the formula syntax, y ~ x | z, which requests a scatterplot of y versus x for each level of the factor z. In general, in lattice plots the use of the vertical bar | requests individual plots in panels for each value of the factor specified on the right.

You try: Remake this plot but instead of conveying genotype via color, show developmental stage.

Stripplot

The next set of figures we will make requires yet more data reshaping, which is a substantial background task in many analyses. We drop the idea of crabHammer being a predictor and eggBomb and poisonFang being responses and we just treat them all equivalently.

oDat <- with(kDat, data.frame(sample, devStage, gType, probeset = factor(rep(c("crabHammer", 
    "eggBomb", "poisonFang"), each = nrow(kDat))), geneExp = c(crabHammer, eggBomb, 
    poisonFang)))
str(oDat)

## 'data.frame':    117 obs. of  5 variables:
##  $ sample  : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
##  $ probeset: Factor w/ 3 levels "crabHammer","eggBomb",..: 1 1 1 1 1 1 1 ..
##  $ geneExp : num  10.22 10.02 9.64 9.65 8.58 ...

Our newly reshaped version of kDat, named oDat, has 39 * 3 = 117 observations but is otherwise very similar to nDat.

A stripplot is a univariate scatterplot. Let's inspect the gene expression data, plain and simple.

stripplot(~geneExp, oDat)

Pretty boring and slightly nonsensical! We had to start somewhere. Let's split things out for the three different genes.

stripplot(probeset ~ geneExp, oDat)

Notice that all the data is presented in one panel but with the different genes corresponding to different locations in the y direction. What if we want to put the different genes in different panels?

stripplot(~geneExp | probeset, oDat, layout = c(nlevels(oDat$probeset), 1))

As with xyplot(), the use of the vertical bar induces the creation of different panels. What if we want to see information about wild type versus Nrl knockout?

stripplot(~geneExp | probeset, oDat, layout = c(nlevels(oDat$probeset), 1), 
    groups = gType, auto.key = TRUE)

As with xyplot(), the groups argument creates visual distinction for a factor -- genotype in this example -- and auto.key = TRUE creates a reasonable default legend.

Let's start exploring gene expression changes over the course of development.

stripplot(geneExp ~ devStage, oDat)

Retaining one panel per gene ....

stripplot(geneExp ~ devStage | probeset, oDat, layout = c(nlevels(oDat$probeset), 
    1))

Adding back the genotype information ....

stripplot(geneExp ~ devStage | probeset, oDat, layout = c(nlevels(oDat$probeset), 
    1), groups = gType, auto.key = TRUE)

Adding averages

stripplot(geneExp ~ devStage | probeset, oDat, layout = c(nlevels(oDat$probeset), 
    1), groups = gType, auto.key = TRUE, grid = TRUE, type = c("p", "a"))

The argument 'type' can be used to add a variety of enhancements. Type is specified as a vector (through the use of 'c'). The option 'p' in the above example specifies the data as points on the plot, 'a' refers to getting the average of each category and joining them by a line (other summaries can be requested too). Some of the other options include 'l' for joining points by lines, 'b' for both points and lines, 'r' for adding the fit from a simple linear regression and 'smooth' for adding a nonparametric "smooth" fitted curve.

Densityplot

Here's a nice alternative to histograms!

densityplot(~geneExp, oDat)

The vertical bar works as usual.

densityplot(~geneExp | gType, oDat, grid = TRUE)

groups works as usual -- a real advantage over histogram.

densityplot(~geneExp, oDat, groups = gType, auto.key = TRUE)

The argument 'bw' specifies the bandwidth or the spread of the underlying Gaussian distributions. It controls how smooth this smoothed histogram will be. Though densityplot() has a sensible default, you can always specify directly if you wish. The argument 'n' controls the number of points at which the kernel density estimate is evaluated. It is easy to confuse this with the usual use of 'n' to denote sample size, so beware. If your density looks jaggedy, try increasing 'n'.

jBw <- 0.2
jn <- 400
densityplot(~geneExp, oDat, groups = gType, auto.key = TRUE, bw = jBw, n = jn, 
    main = paste("bw =", jBw, ", n =", jn))

You try: use densityplot() to explore the gene expression distribution by gene and/or developmental stage. Play with 'bw' and 'n' if you like.

There is also a time and place for boxplots, obtained with the lattice function bwplot() for "box-and-whiskers plot".

bwplot(geneExp ~ devStage, oDat)

The vertical bar | still works ....

bwplot(geneExp ~ devStage | gType, oDat)

A violinplot is a hybrid of densityplot and histogram.

bwplot(geneExp ~ devStage, oDat, panel = panel.violin)

This is our first explicit glimpse of how a panel function might be specified to some non-default value. More on that can be found in Sarkar's book and the STAT545A materials.

Heatmaps

Now we need a larger dataset. Let's load the real full data matrix and experimental design now.

The experimental data, i.e. covariate information for the samples, is available in plain text form in the file GSE4051_design.txt but you will notice below that we prefer to load it from a saved R object, which preserves factor levels. To get the gory details on the how and why of this sort of data cleaning, go here.

When this class actually ran, my web set-up was different. I've applied a quick fix to Just. Get. The. Data. In.

dataURL <- "http://www.stat.ubc.ca/%7Ejenny/notOcto/STAT545A/examples/photoRec/GSE4051_data.txt"
prDat <- read.table(dataURL)
str(prDat)

## 'data.frame':    29949 obs. of  39 variables:
##  $ Sample_20: num  7.24 9.48 10.01 8.36 8.59 ...
##  $ Sample_21: num  7.41 10.02 10.04 8.37 8.62 ...
##  $ Sample_22: num  7.17 9.85 9.91 8.4 8.52 ...
##  $ Sample_23: num  7.07 10.13 9.91 8.49 8.64 ...
##  $ Sample_16: num  7.38 7.64 8.42 8.36 8.51 ...
##  $ Sample_17: num  7.34 10.03 10.24 8.37 8.89 ...
##  $ Sample_6 : num  7.24 9.71 10.17 8.84 8.54 ...
##  $ Sample_24: num  7.11 9.75 9.39 8.37 8.36 ...
##  $ Sample_25: num  7.19 9.16 10.11 8.2 8.5 ...
##  $ Sample_26: num  7.18 9.49 9.41 8.73 8.39 ...
##  $ Sample_27: num  7.21 8.64 9.43 8.33 8.43 ...
##  $ Sample_14: num  7.09 9.56 9.88 8.57 8.59 ...
##  $ Sample_3 : num  7.16 9.55 9.84 8.33 8.5 ...
##  $ Sample_5 : num  7.08 9.32 9.24 8.3 8.48 ...
##  $ Sample_8 : num  7.11 8.24 9.13 8.13 8.33 ...
##  $ Sample_28: num  7.34 8.27 9.47 8.38 8.4 ...
##  $ Sample_29: num  7.66 10.03 9.88 8.56 8.69 ...
##  $ Sample_30: num  7.26 9.27 10.54 8.15 8.55 ...
##  $ Sample_31: num  7.31 9.26 10.1 8.37 8.49 ...
##  $ Sample_1 : num  7.15 9.87 9.68 8.28 8.5 ...
##  $ Sample_10: num  7.28 10.29 9.91 8.42 8.68 ...
##  $ Sample_4 : num  7.18 10.16 9.72 8.32 8.5 ...
##  $ Sample_7 : num  7.15 8.95 9.3 8.17 8.41 ...
##  $ Sample_32: num  7.54 9.53 9.92 8.78 8.57 ...
##  $ Sample_33: num  7.01 8.97 9.22 8.42 8.53 ...
##  $ Sample_34: num  6.81 8.83 9.39 8.1 8.32 ...
##  $ Sample_35: num  7.15 9.22 10.06 8.35 8.45 ...
##  $ Sample_13: num  7.33 9.33 9.75 8.43 8.48 ...
##  $ Sample_15: num  7.12 9.15 9.84 8.32 8.21 ...
##  $ Sample_18: num  7.21 9.49 10.03 8.55 8.5 ...
##  $ Sample_19: num  7.21 9.21 9.59 8.31 8.31 ...
##  $ Sample_36: num  7.25 9.66 9.51 8.49 8.42 ...
##  $ Sample_37: num  7.04 8.38 9.21 8.75 8.26 ...
##  $ Sample_38: num  7.37 9.44 9.48 8.49 8.34 ...
##  $ Sample_39: num  7.13 8.73 9.53 8.65 8.28 ...
##  $ Sample_11: num  7.42 9.83 10 8.6 8.43 ...
##  $ Sample_12: num  7.11 9.71 9.43 8.43 8.5 ...
##  $ Sample_2 : num  7.35 9.66 9.91 8.4 8.37 ...
##  $ Sample_9 : num  7.32 9.8 9.85 8.4 8.46 ...

designURL <- "http://www.stat.ubc.ca/%7Ejenny/notOcto/STAT545A/examples/photoRec/GSE4051_design_DPUT.txt"
prDes <- dget(designURL)
str(prDes)

## 'data.frame':    39 obs. of  3 variables:
##  $ sample  : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...

Let's draw 50 probesets at random -- but in a repeatable way! Read the random seed module to remind yourself how and why we do this.

set.seed(2128)
(yo <- sample(1:nrow(prDat), size = 50))

##  [1] 16901 20791  2108 10647 26484  5620  9261  2184 14226 23904  5580
## [12] 17665  1510 15780  2113  6288 10356 14989  5578  5283 15457 18321
## [23] 11043  6510 13921  3230  6240 28508 12741 14590  5840 25206 22600
## [34] 26976   867 22068 22546 23815  1312  7477   221 18547 21479 10363
## [45] 25165 20794 12349 27395 12872 20432

hDat <- prDat[yo, ]
str(hDat)

## 'data.frame':    50 obs. of  39 variables:
##  $ Sample_20: num  6.68 7.96 6.26 8.7 8.88 ...
##  $ Sample_21: num  6.33 7.54 6.4 8.78 9.57 ...
##  $ Sample_22: num  6.61 7.58 6.25 8.61 9.26 ...
##  $ Sample_23: num  6.34 7.6 6.49 8.83 9.54 ...
##  $ Sample_16: num  6.96 7.84 6.45 9.04 8.78 ...
##  $ Sample_17: num  6.58 7.49 6.2 9.04 8.98 ...
##  $ Sample_6 : num  6.44 7.83 6.43 9.05 8.94 ...
##  $ Sample_24: num  7.08 7.75 6.34 8.78 8.55 ...
##  $ Sample_25: num  6.54 7.49 6.4 8.75 8.67 ...
##  $ Sample_26: num  6.91 7.84 6.43 8.85 8.3 ...
##  $ Sample_27: num  6.64 7.77 6.44 8.85 8.67 ...
##  $ Sample_14: num  6.85 7.55 6.57 8.97 9.07 ...
##  $ Sample_3 : num  6.7 7.65 6.59 8.85 9.1 ...
##  $ Sample_5 : num  6.89 7.61 6.56 8.85 9.24 ...
##  $ Sample_8 : num  7 7.92 6.75 8.94 8.72 ...
##  $ Sample_28: num  6.92 7.68 6.23 8.95 8.57 ...
##  $ Sample_29: num  6.66 7.65 5.96 8.57 8.5 ...
##  $ Sample_30: num  6.95 7.48 6.68 8.47 8.16 ...
##  $ Sample_31: num  6.69 7.65 6.45 8.67 8.54 ...
##  $ Sample_1 : num  6.84 7.63 6.68 8.62 8.93 ...
##  $ Sample_10: num  6.68 7.61 6.69 8.53 9.23 ...
##  $ Sample_4 : num  6.81 7.52 6.72 8.52 9.53 ...
##  $ Sample_7 : num  6.96 7.45 6.49 8.63 8.43 ...
##  $ Sample_32: num  6.91 7.57 6.51 8.54 8.54 ...
##  $ Sample_33: num  6.97 7.55 6.52 8.52 8.17 ...
##  $ Sample_34: num  7.15 7.52 6.63 8.58 7.95 ...
##  $ Sample_35: num  6.77 7.54 6.64 8.69 8.63 ...
##  $ Sample_13: num  6.76 7.71 6.15 8.41 8.93 ...
##  $ Sample_15: num  6.83 7.88 6.44 8.49 8.54 ...
##  $ Sample_18: num  6.55 7.65 6.33 8.48 8.86 ...
##  $ Sample_19: num  6.79 7.9 6.43 8.76 8.38 ...
##  $ Sample_36: num  6.8 7.4 6.33 8.33 8.89 ...
##  $ Sample_37: num  6.88 7.79 6.51 8.58 8.73 ...
##  $ Sample_38: num  7.01 7.62 6.35 8.48 8.58 ...
##  $ Sample_39: num  6.77 7.58 6.22 8.47 8.53 ...
##  $ Sample_11: num  6.64 7.65 6.33 8.38 9.55 ...
##  $ Sample_12: num  6.74 7.81 6.68 8.69 9.4 ...
##  $ Sample_2 : num  6.62 7.63 6.26 8.44 9.31 ...
##  $ Sample_9 : num  6.55 7.69 6.18 8.85 9.15 ...

The functions for heatmapping expect a matrix not a data.frame, so we will convert hDat and also transpose for a nicer heatmap orientation below. I also give the samples more informative names that capture genotype and developmental stage.

hDat <- as.matrix(t(hDat))
rownames(hDat) <- with(prDes, paste(devStage, gType, "s", sample))
str(hDat)

##  num [1:39, 1:50] 6.68 6.33 6.61 6.34 6.96 ...
##   ..$ : chr [1:50] "1439675_at" "1446739_at" "1418248_at" "1430530_s_at" ..
##   ..$ : chr [1:39] "E16 wt s 20" "E16 wt s 21" "E16 wt s 22" "E16 wt s 23" ...
##   ..$ : chr [1:50] "1439675_at" "1446739_at" "1418248_at" "1430530_s_at" ..

The basic built-in function for heatmapping is heatmap(). Warning: these Oscar Mayer / McDonald's colors are awful.

heatmap(hDat, Rowv = NA, Colv = NA, scale = "none", margins = c(5, 8))

Some of the other built-in color schemes aren't quite as likely to make your eyes bleed ...

heatmap(hDat, Rowv = NA, Colv = NA, col = cm.colors(256), scale = "none", margins = c(5, 
    8))

But long-term it is good to learn how to take control of the color scheme. To help with that, we load the package RColorBrewer that has some useful pre-selected palettes I often use as the basis for a variety of color tasks.

If the package is not already installed, install it

install.packages("RColorBrewer")

Then load it and take a look at the palettes it offers.

library(RColorBrewer)
display.brewer.all()

Warning: this will seem arcane. To map a small number of colors into a ramp or palette suitable for encoding a quantitative variable as colors, the function colorRampPalette() is useful. Its somewhat surprising return value is a function itself that takes a number n as input and returns a correspondingly sized color ramp. Here I prepare to use a subdued gray palette and also a more stimulating pale blue to purple palette.

jGraysFun <- colorRampPalette(brewer.pal(n = 9, "Greys"))
jBuPuFun <- colorRampPalette(brewer.pal(n = 9, "BuPu"))

Let's revisit the first heatmap in grays....

heatmap(hDat, Rowv = NA, Colv = NA, scale = "none", margins = c(5, 8), col = jGraysFun(256))

and the blue-purple palette.

heatmap(hDat, Rowv = NA, Colv = NA, scale = "none", margins = c(5, 8), col = jBuPuFun(256))

By specifying Rowv = NA, Colv = NA, scale = "none", we have been suppressing some rather common heatmap features -- the inclusion of row and column dendrograms and the normalization of the data. Let's look at the heatmap as it would be rendered by default.

heatmap(hDat, margins = c(5, 8), col = jBuPuFun(256))

Now we allow scaling within column:

heatmap(hDat, col = jBuPuFun(256), margins = c(5, 8), scale = c("column"))

Finally we try out another popular heatmapping function heatmap.2() from the gplots package. This adds an automatic color legend, which helps you determine what each color extreme actually means. If you need to install the package do this:

install.packages("gplots")

library(gplots)
heatmap.2(hDat, col = jGraysFun, trace = "none")

heatmap.2(hDat, col = jBuPuFun, trace = "none")

High-volume Scatterplots

Now that we've loaded the main dataset we can also explore high-volume scatterplotting. First let's pick two samples at random to plot against each other.

set.seed(924)
(yo <- sample(1:ncol(prDat), size = 2))

## [1] 25 24

y <- prDat[[yo[1]]]
z <- prDat[[yo[2]]]
str(y)

##  num [1:29949] 7.01 8.97 9.22 8.42 8.53 ...

str(z)

##  num [1:29949] 7.54 9.53 9.92 8.78 8.57 ...

xyplot(y ~ z, asp = 1)

See the giant dark point cloud of death? What's going on in there? Who knows. The overplotting is potentially hiding information about the joint distribution. Also, these plots use up lots of ink when printing and take a long time to load if you're giving a talk. Stop using plain vanilla scatterplot tools for very high-volume plots. The superior alternatives divide the plane into small neighborhoods and shade them according to how many points fall there.

The base package graphics ('base' meaning always installed, always loaded) offers the smoothScatter() function which shades the plane according to a 2D kernel density estimate.

smoothScatter(y ~ z, asp = 1)

You can see that we were missing some information in the dark cloud above. There is one main clump of data, concentrated around (6, 6) and then petering out diagonally up the x = y line. There's arguably a second, smaller clump of data on a steeper line running through the points ~(10, 8) and ~(14, 14).

The xyplot() function in lattice can produce a similar plot by specifying a smoothScatter-type of panel function.

xyplot(y ~ z, asp = 1, panel = panel.smoothScatter, nbin = 150)

The add-on package hexbin implements hexagonal binning. Basically the plane is divided into hexagons and shaded as described above. Install it if you need to.

install.packages("hexbin")

library(hexbin)
hexbinplot(y ~ z)

Finally, if you want to plot several variables against each other, i.e. create a scatterplot matrix, there are several functions for that.

base graphics offers pairs()
lattice offers splom() for scatterplot matrix
hexbin offers hexplom()
ggplot2 offers plotmatrix() (sorry we won't go here today)

We need to take a slightly larger sample of columns now.

set.seed(624)
(yo <- sample(1:ncol(prDat), size = 4))

## [1] 37 11  1 28

pairDat <- subset(prDat, select = yo)
str(pairDat)

## 'data.frame':    29949 obs. of  4 variables:
##  $ Sample_12: num  7.11 9.71 9.43 8.43 8.5 ...
##  $ Sample_27: num  7.21 8.64 9.43 8.33 8.43 ...
##  $ Sample_20: num  7.24 9.48 10.01 8.36 8.59 ...
##  $ Sample_13: num  7.33 9.33 9.75 8.43 8.48 ...

Using the base function pairs() ... You will notice this is a bit slow and we get the usual awful dark point clouds.

pairs(pairDat)

However, pairs() can be combined with smoothScatter() for a better result. Somewhat faster and definitely better looking, more informative.

pairs(pairDat, panel = function(...) smoothScatter(..., add = TRUE))

Here's splom() from lattice, first using the default, non-high-volume panel function.

splom(pairDat)

Here's splom() from lattice again, but using a smoothScatter-type panel function. Much faster! More informative!

splom(pairDat, panel = panel.smoothScatter, raster = TRUE)

Finally, here's hexplom().

hexplom(pairDat)

Take-home problem

The full photoRec dataset has 39 samples and 29,949 probesets. Choose 2 ... or 20 ... or 200 random probesets/genes and look for gene expression differences between the two genotypes, wild type versus knockout. Make use of the graphing techniques discussed this week such as scatter plots, data heatmaps, correlation heatmaps, etc. Share questions, success, failure on the Google group.