Backfilling some technical info on lattice

Backfilling some technical info on `lattice`

Now that you have made many lattice plots, let's fill in some technical details that will help you become even more effective in your use of lattice.

Optional getting started advice

Ignore if you don't need this bit of support.

This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the Gapminder project. Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an RStudio project. To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the getwd() command or by glancing at the top of RStudio's Console pane.

Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodicially copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "technical background" and "lattice".

Load the Gapminder data, drop Oceania, load packages

Assuming the data can be found in the current working directory, this works:

gDat <- read.delim("gapminderDataFiveYear.txt")

Plan B (I use here, because of where the source of this tutorial lives):

## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)

Basic sanity check that the import has gone well:

str(gDat)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Drop Oceania, which only has two continents

## drop Oceania
jDat <- droplevels(subset(gDat, continent != "Oceania"))
str(jDat)

## 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Load the lattice and plyr packages.

library(lattice)
library(plyr)

High level `lattice` functions actually return objects

myAwesomePlot <-
  xyplot(lifeExp ~ gdpPercap | continent, jDat,
         scales = list(x = list(log = 10, equispaced.log = FALSE)),
         type = c("p", "smooth"), grid = TRUE, col.line = "darkorange", lwd = 4)

Hello? Where is the plot? Our previous plots have been appearing in a graphics device on your screen (and inserted into this document) because, by default, the print() method for objects of class trellis has been dispatched on the return value of our high-level lattice commands. If you make an assignment, as we did above, or if the command is wrapped inside a function, inside a plyr call, or inside a loop, automatic printing will not occur. To explicitly print the figure use print() or plot().

print(myAwesomePlot)

It is possible to do other things to a trellis object, such as update() or summary() but I rarely do.

Putting more than one `lattice` plot on a "page"

Another reason to assign the object and call print() explicitly is to exploit additional arguments (read the help on print.trellis() to see them all). The ones you will use most facilitate the placement of more than 1 figure on a page; the arguments arepos = and split =. Here we use print(..., pos =...) to arrange two plots together.

myOtherPlot <- stripplot(lifeExp ~ reorder(continent, lifeExp),
                         subset(jDat, subset = year %in% c(1952, 1977, 2007)),
                         groups = year, auto.key = list(reverse.rows = TRUE),
                         jitter.data = TRUE, type = c("p", "a"), fun = median)

## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion

print(myAwesomePlot, pos = c(0, 0, 0.52, 1), more = TRUE)
print(myOtherPlot, pos = c(0.48, 0, 1, 1))

The split = argument is also useful for this purpose but I prefer pos = because I can specify a wee bit of overlap to reduce whitespace between the plots. In the calls above, note that the lefthand plot runs from the far left to just past the halfway mark and the right hand plot begins just before halfway and ends at the far right.

Extended formula interace of `lattice`

You are probably beginning to realize that data in a tall or long form is easier to plot than data in a short or wide form. Sadly, exactly the opposite is true for creating human-facing tables. So a skilled data analyst is adept at reshaping. That topic is covered more here (link to future reshaping topic). Feeling lazy? lattice does offer one trick that may let you get away with short/wide data: the extended formula interface.

Consider one of our data aggregation tasks: looking at different measures of spread for life expectancy over time, across different continents.

lifeExpSpread <- ddply(jDat, ~ continent + year, summarize,
                       sdGdpPercap = sd(gdpPercap), iqrGdpPercap = IQR(gdpPercap),
                       madGdpPercap = mad(gdpPercap))
head(lifeExpSpread)

##   continent year sdGdpPercap iqrGdpPercap madGdpPercap
## 1    Africa 1952         983        919.9        696.9
## 2    Africa 1957        1135        966.1        712.1
## 3    Africa 1962        1462       1028.9        786.3
## 4    Africa 1967        2848       1110.3        743.0
## 5    Africa 1972        3287       1476.3        952.1
## 6    Africa 1977        4142       2035.7       1015.2

In our aggregated result lifeExpSpread, each measure of spread exists as its own variable. Let's say we want to plot those over time, distinguishing the different measures via color and superposition or by multi-panel conditioning, possibly depicting continent as well. To use the native lattice functionality for superposition or conditioning, we would need to reshape the data. The variables sdGdpPercap, iqrGdpPercap, and madGdpPercap would need to be catenated and a parallel factor created to indicate sd vs iqr vs mad. If you have no other need for such reshaped data, you can bypass its creation by using the extended formula interface.

## cheap trick? using lattice's extended formula interface to avoiding reshaping
xyplot(sdGdpPercap + iqrGdpPercap + madGdpPercap ~ year, lifeExpSpread,
       subset = continent == "Africa", type = "b", ylab = "measure of spread",
       auto.key = list(x = 0.07, y = 0.85, corner = c(0, 1)))
xyplot(sdGdpPercap + iqrGdpPercap + madGdpPercap ~ year, lifeExpSpread,
       group = reorder(continent, sdGdpPercap), layout = c(3, 1),
       type = "b", ylab = "measure of spread",
       auto.key = list(x = 0.35, y = 0.85, corner = c(0, 1),
                       reverse.rows = TRUE))

## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion

In the first figure, we focus on one continent. But we achieve an effect similar to the use of the group = argument by specifying multiple terms as primary variables, separated by plus + signs. The reshaping described above is carried out temporarily by lattice; in particular, a virtual factor is created to distinguish sd vs iqr vs mad. In the second figure, we explicitly request group = continent and so the different measures of spread are parcelled out to different panels, as if we had properly requested multi-panel conditioning on the virtual factor.

If you want the virtual factor to be treated as a conditioning factor, the explicit way to request is to set outer = TRUE. Above this is achieved as a side effect of specifying continent as the grouping variable. Here's a more useful, typical example of how this works:

xyplot(sdGdpPercap + iqrGdpPercap + madGdpPercap ~ year, lifeExpSpread,
       subset = continent == "Africa", type = "b", ylab = "measure of spread",
       outer = TRUE, layout = c(3, 1), aspect = 1)

Read more in section 10.2 "The extended formula interface" in Sarkar's lattice book.

High level `lattice` functions have panel functions at their heart

Functions like stripplot() and xyplot() are examples of high level lattice functions. Table 1.1 in Sarkar's lattice book lists 15 high level functions.

Table 1.1 of Sarkar

We've pointed out many of the useful features in common to these functions: the data = and subset = arguments for specifying input data, the formula interface (which supports multipanel conditioning), and the group = and type = arguments. All high level functions are wrappers around lots of smaller functions that do the grunt work like set up layouts and coordinate systems, draw axes, and add graphical elements. The high level functions can be organized into groups and, within these groups, there is a lot of shared infrastructure with respect to the inner workings. It can simplify your life to understand this, so you can transfer knowledge and tricks from one function to another. Here are the functions you will use most, organized into groups:

densityplot(), histogram()
xyplot(), stripplot(), bwplot(), dotplot(), barchart()
levelplot(), contourplot()
splom(), parallelplot() (was parallel())

Of all the lower level functions wrapped up inside a high level function, the most important one is the panel function, which does the actual plotting. Every lattice plot consists of at least one panel; in the case of multi-panel conditioning via e.g., y ~ x | z, there will be one panel per level of z. The input data is chopped into pieces called packets, one per panel, very much like the splitting done in plyr::ddply(). The panel function then plots the data from one packet into its associated panel.

Each high level function has a default panel function and, by convention, its name is panel. followed by the name of the high level function. When simple customizations aren't enough, the first thing to do is to read the help on the default panel function. Seriously, right now go read the help on panel.xyplot()! This is where you will gain deeper knowledge of the arguments available for customization. Even though these are arguments of the panel function itself, you can often set them in a high level function call, and they will be passed through due to the use of the special ... argument (read more about that here. In fact, you've already seen examples like this:

xyplot(lifeExp ~ gdpPercap, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       type = c("p", "smooth"), col.line = "darkorange", lwd = 3)

Call up the help file on xyplot() and search it for arguments col.line and lwd. You will not find them (OK, lwd is mentioned in the documentation about keys but that doesn't count). But you will find them in the help for panel.xyplot(). Learn this lesson well. It's how you can take your lattice plots to the next level without too much pain.

Customizing the panel function

Sometimes the customization you need will require you to modify the panel function. Relax! It's really not that hard and is very rewarding. We will walk through a typical process for writing a custom panel function 'on the fly'.

I present a modified excerpt from my development process for creating a single year "still snapshot" of a typical Gapminder moving bubble chart. I load a souped-up version of the Gapminder data you know and love that [1] includes some stunning country colors and; [2] is sorted by country size so little countries aren't hidden by big ones. Furthermore, we work with a subset of the data for just two years.

gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderWithColorsAndSorted.txt"
kDat <- read.delim(file = gdURL, as.is = 7) # protect color
str(kDat)

## 'data.frame':    1704 obs. of  7 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 25 59 135 67 60 48 1..
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 2 3 3 4 2 4 ..
##  $ year     : int  1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
##  $ pop      : num  5.56e+08 3.72e+08 1.58e+08 8.65e+07 8.21e+07 ...
##  $ lifeExp  : num  44 37.4 68.4 63 37.5 ...
##  $ gdpPercap: num  400 547 13990 3217 750 ...
##  $ color    : chr  "#40004B" "#460552" "#A50026" "#611A6D" ...

jYear <- c(1952, 2007)
yDat <- subset(kDat, year %in% jYear)
str(yDat)

## 'data.frame':    284 obs. of  7 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 25 59 135 67 60 48 1..
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 2 3 3 4 2 4 ..
##  $ year     : int  1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
##  $ pop      : num  5.56e+08 3.72e+08 1.58e+08 8.65e+07 8.21e+07 ...
##  $ lifeExp  : num  44 37.4 68.4 63 37.5 ...
##  $ gdpPercap: num  400 547 13990 3217 750 ...
##  $ color    : chr  "#40004B" "#460552" "#A50026" "#611A6D" ...

First start with a high level function call that works, but is missing some of the cool features you want. Crawl before you walk. Walk before you run.

xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
       grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)))

My first step is to specify the panel function explicitly but, of course, to specify the default. Do you think pilots switch off auto-pilot in the middle of a delicate maneuver? No, they prefer to take manual control when the plane is flying smoothly at 30,000 feet in clear skies. You will see no change in the figure, which is exactly what we want at this point.

xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
       grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
       panel = panel.xyplot)

Now I write a custom panel function 'on the fly' but the function body just consists of the defaul panel function. Any arguments normally passed to the panel function are passed straight through to panel.xyplot(). I omit the plot because there is no visible change, which is exactly what I want right now.

xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
       grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
       panel = function(...) {
         panel.xyplot(...)
         })

Now I modify the custom panel function to have arguments x, y, and .... If you read the help on panel.xyplot(), you will notice that x and y are the first two formal arguments and they do not have default values. This is why, if I want to specify any of the other arguments, I must explicitly specify x and y. Below, I take whatever would have normally been given to the panel function as x and y and am passing them straight through. I omit the plot because there is no visible change, which is exactly what I want right now.

xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
       grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
       panel = function(x, y, ...) {
         panel.xyplot(x, y, ...)
         })

Now I take control of the plotting symbol: I make it an empty circle, which happens to be symbol 21. Conceptually, I scale the radius of the circle so that the area will reflect the country's population. The logic and testing of this mapping of population into circle size was done elsewhere and required some thought and care, FYI. The mechanics are a bit tricky: The term cex (think "character expansion) is conventionally used in R as an argument controlling font or symbol size. That's the dial I need to turn to change circle size. I set the cex = argument of the high level xyplot() call to a vector corresponding to the rows of the input data yDat. Since xyplot() doesn't have a cex = argument, it passes that down to the panel function, which is panel.xyplot() in this case, which does have a cex = argument. Then, inside the panel function, I consult the subscripts argument to learn which rows of the original input data are in the current packet. I can then index cex by these subscripts to get the corect radii and draw the circles. Phew.

jCexDivisor <- 1500                     # arbitrary scaling constant
jPch <- 21
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
       grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
       cex = sqrt(yDat$pop/pi)/jCexDivisor, 
       panel = function(x, y, ..., cex, subscripts) {
         panel.xyplot(x, y, cex = cex[subscripts], pch = jPch, ...)
         })

Now I fill the circles with the pre-chosen country colors and explicitly set the circle outlines to a nice tasteful grey. The fill color mechanics mirror those used for setting the circle size.

jDarkGray <- 'grey20'
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
       grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
       cex = sqrt(yDat$pop/pi)/jCexDivisor, fill.color = yDat$color,
       col = jDarkGray,
       panel = function(x, y, ..., cex, fill.color, subscripts) {
         panel.xyplot(x, y, cex = cex[subscripts],
                      pch = jPch, fill = fill.color[subscripts], ...)
         })

Since I've come this far, I will go ahead and add a key to link the continents to colors (or color families, here). This has nothing to do with panel functions!

gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderContinentColors.txt"
continentColors <- read.delim(file = gdURL, as.is = 3) # protect color
continentKey <-
  with(continentColors,
       list(x = 0.95, y = 0.05, corner = c(1, 0),
            text = list(as.character(continent)),
            points = list(pch = jPch, col = jDarkGray, fill = color)))
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
       grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
       cex = sqrt(yDat$pop/pi)/jCexDivisor, fill.color = yDat$color,
       col = jDarkGray, key = continentKey,
       panel = function(x, y, ..., cex, fill.color, subscripts) {
         panel.xyplot(x, y, cex = cex[subscripts],
                      pch = jPch, fill = fill.color[subscripts], ...)
         })

To conclude, we present our starting and ending plots side by side.

xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
       grid = TRUE, scales = list(x = list(log = 10)))
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
       grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
       cex = sqrt(yDat$pop/pi)/jCexDivisor, fill.color = yDat$color,
       col = jDarkGray, key = continentKey,
       panel = function(x, y, ..., cex, fill.color, subscripts) {
         panel.xyplot(x, y, cex = cex[subscripts],
                      pch = jPch, fill = fill.color[subscripts], ...)
         })

The differences in the code are really not that huge, considering how different the output is.

Student Q & A

Q: What if you want to plot 2 quantitative variables against another BUT they are on completely different scales? We got kind of lucky above with the different measures of spread -- there was considerable overlap in their observed values and therefore the extended formula interface and a single y-axis "just worked".

A: How to do this is demonstrated in a few places in Sarkar's lattice book. This requires these manipulations:

rescaling at least one variable, to get it on the same scale as the other MANDATORY
adding an axis for the rescaled variable STRONGLY ENCOURAGED
distributing type = across the multiple primary variables via distribute.type = TRUE DO IF NEEDED

See section 5.3.1 (p. 79 - 82) and section 8.4.2 (p. 148 - 150).

References

Lattice: Multivariate Data Visualization with R available via SpringerLink by Deepayan Sarkar, Springer (2008) | all code from the book | GoogleBooks search

My step-by-step development of a custom panel function was heavily inspired by "Lattice Tricks for the Power UseR", a talk by Deepayan Sarkar at the 2007 UseR! conference.