lattice
lattice
functions actually return objectslattice
plot on a "page"lattice
lattice
functions have panel functions at their heartlattice
Now that you have made many lattice
plots, let's fill in some technical details that will help you become even more effective in your use of lattice
.
Ignore if you don't need this bit of support.
This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the Gapminder project. Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an RStudio project. To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the getwd()
command or by glancing at the top of RStudio's Console pane.
Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodicially copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "technical background" and "lattice".
Assuming the data can be found in the current working directory, this works:
gDat <- read.delim("gapminderDataFiveYear.txt")
Plan B (I use here, because of where the source of this tutorial lives):
## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
Basic sanity check that the import has gone well:
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
Drop Oceania, which only has two continents
## drop Oceania
jDat <- droplevels(subset(gDat, continent != "Oceania"))
str(jDat)
## 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
Load the lattice
and plyr
packages.
library(lattice)
library(plyr)
lattice
functions actually return objectsmyAwesomePlot <-
xyplot(lifeExp ~ gdpPercap | continent, jDat,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
type = c("p", "smooth"), grid = TRUE, col.line = "darkorange", lwd = 4)
Hello? Where is the plot? Our previous plots have been appearing in a graphics device on your screen (and inserted into this document) because, by default, the print()
method for objects of class trellis
has been dispatched on the return value of our high-level lattice
commands. If you make an assignment, as we did above, or if the command is wrapped inside a function, inside a plyr
call, or inside a loop, automatic printing will not occur. To explicitly print the figure use print()
or plot()
.
print(myAwesomePlot)
It is possible to do other things to a trellis
object, such as update()
or summary()
but I rarely do.
lattice
plot on a "page"Another reason to assign the object and call print()
explicitly is to exploit additional arguments (read the help on print.trellis()
to see them all). The ones you will use most facilitate the placement of more than 1 figure on a page; the arguments arepos =
and split =
. Here we use print(..., pos =...)
to arrange two plots together.
myOtherPlot <- stripplot(lifeExp ~ reorder(continent, lifeExp),
subset(jDat, subset = year %in% c(1952, 1977, 2007)),
groups = year, auto.key = list(reverse.rows = TRUE),
jitter.data = TRUE, type = c("p", "a"), fun = median)
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
print(myAwesomePlot, pos = c(0, 0, 0.52, 1), more = TRUE)
print(myOtherPlot, pos = c(0.48, 0, 1, 1))
The split =
argument is also useful for this purpose but I prefer pos =
because I can specify a wee bit of overlap to reduce whitespace between the plots. In the calls above, note that the lefthand plot runs from the far left to just past the halfway mark and the right hand plot begins just before halfway and ends at the far right.
lattice
You are probably beginning to realize that data in a tall or long form is easier to plot than data in a short or wide form. Sadly, exactly the opposite is true for creating human-facing tables. So a skilled data analyst is adept at reshaping. That topic is covered more here (link to future reshaping topic). Feeling lazy? lattice
does offer one trick that may let you get away with short/wide data: the extended formula interface.
Consider one of our data aggregation tasks: looking at different measures of spread for life expectancy over time, across different continents.
lifeExpSpread <- ddply(jDat, ~ continent + year, summarize,
sdGdpPercap = sd(gdpPercap), iqrGdpPercap = IQR(gdpPercap),
madGdpPercap = mad(gdpPercap))
head(lifeExpSpread)
## continent year sdGdpPercap iqrGdpPercap madGdpPercap
## 1 Africa 1952 983 919.9 696.9
## 2 Africa 1957 1135 966.1 712.1
## 3 Africa 1962 1462 1028.9 786.3
## 4 Africa 1967 2848 1110.3 743.0
## 5 Africa 1972 3287 1476.3 952.1
## 6 Africa 1977 4142 2035.7 1015.2
In our aggregated result lifeExpSpread
, each measure of spread exists as its own variable. Let's say we want to plot those over time, distinguishing the different measures via color and superposition or by multi-panel conditioning, possibly depicting continent as well. To use the native lattice
functionality for superposition or conditioning, we would need to reshape the data. The variables sdGdpPercap
, iqrGdpPercap
, and madGdpPercap
would need to be catenated and a parallel factor created to indicate sd
vs iqr
vs mad
. If you have no other need for such reshaped data, you can bypass its creation by using the extended formula interface.
## cheap trick? using lattice's extended formula interface to avoiding reshaping
xyplot(sdGdpPercap + iqrGdpPercap + madGdpPercap ~ year, lifeExpSpread,
subset = continent == "Africa", type = "b", ylab = "measure of spread",
auto.key = list(x = 0.07, y = 0.85, corner = c(0, 1)))
xyplot(sdGdpPercap + iqrGdpPercap + madGdpPercap ~ year, lifeExpSpread,
group = reorder(continent, sdGdpPercap), layout = c(3, 1),
type = "b", ylab = "measure of spread",
auto.key = list(x = 0.35, y = 0.85, corner = c(0, 1),
reverse.rows = TRUE))
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion Warning: NAs introduced by coercion
In the first figure, we focus on one continent. But we achieve an effect similar to the use of the group =
argument by specifying multiple terms as primary variables, separated by plus +
signs. The reshaping described above is carried out temporarily by lattice
; in particular, a virtual factor is created to distinguish sd
vs iqr
vs mad
. In the second figure, we explicitly request group = continent
and so the different measures of spread are parcelled out to different panels, as if we had properly requested multi-panel conditioning on the virtual factor.
If you want the virtual factor to be treated as a conditioning factor, the explicit way to request is to set outer = TRUE
. Above this is achieved as a side effect of specifying continent
as the grouping variable. Here's a more useful, typical example of how this works:
xyplot(sdGdpPercap + iqrGdpPercap + madGdpPercap ~ year, lifeExpSpread,
subset = continent == "Africa", type = "b", ylab = "measure of spread",
outer = TRUE, layout = c(3, 1), aspect = 1)
Read more in section 10.2 "The extended formula interface" in Sarkar's lattice
book.
lattice
functions have panel functions at their heartFunctions like stripplot()
and xyplot()
are examples of high level lattice
functions. Table 1.1 in Sarkar's lattice
book lists 15 high level functions.
Table 1.1 of Sarkar
We've pointed out many of the useful features in common to these functions: the data =
and subset =
arguments for specifying input data, the formula interface (which supports multipanel conditioning), and the group =
and type =
arguments. All high level functions are wrappers around lots of smaller functions that do the grunt work like set up layouts and coordinate systems, draw axes, and add graphical elements. The high level functions can be organized into groups and, within these groups, there is a lot of shared infrastructure with respect to the inner workings. It can simplify your life to understand this, so you can transfer knowledge and tricks from one function to another. Here are the functions you will use most, organized into groups:
densityplot()
, histogram()
xyplot()
, stripplot()
, bwplot()
, dotplot()
, barchart()
levelplot()
, contourplot()
splom()
, parallelplot()
(was parallel()
)Of all the lower level functions wrapped up inside a high level function, the most important one is the panel function, which does the actual plotting. Every lattice
plot consists of at least one panel; in the case of multi-panel conditioning via e.g., y ~ x | z
, there will be one panel per level of z
. The input data is chopped into pieces called packets, one per panel, very much like the splitting done in plyr::ddply()
. The panel function then plots the data from one packet into its associated panel.
Each high level function has a default panel function and, by convention, its name is panel.
followed by the name of the high level function. When simple customizations aren't enough, the first thing to do is to read the help on the default panel function. Seriously, right now go read the help on panel.xyplot()
! This is where you will gain deeper knowledge of the arguments available for customization. Even though these are arguments of the panel function itself, you can often set them in a high level function call, and they will be passed through due to the use of the special ...
argument (read more about that here. In fact, you've already seen examples like this:
xyplot(lifeExp ~ gdpPercap, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
type = c("p", "smooth"), col.line = "darkorange", lwd = 3)
Call up the help file on xyplot()
and search it for arguments col.line
and lwd
. You will not find them (OK, lwd
is mentioned in the documentation about keys but that doesn't count). But you will find them in the help for panel.xyplot()
. Learn this lesson well. It's how you can take your lattice
plots to the next level without too much pain.
Sometimes the customization you need will require you to modify the panel function. Relax! It's really not that hard and is very rewarding. We will walk through a typical process for writing a custom panel function 'on the fly'.
I present a modified excerpt from my development process for creating a single year "still snapshot" of a typical Gapminder moving bubble chart. I load a souped-up version of the Gapminder data you know and love that [1] includes some stunning country colors and; [2] is sorted by country size so little countries aren't hidden by big ones. Furthermore, we work with a subset of the data for just two years.
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderWithColorsAndSorted.txt"
kDat <- read.delim(file = gdURL, as.is = 7) # protect color
str(kDat)
## 'data.frame': 1704 obs. of 7 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 25 59 135 67 60 48 1..
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 2 3 3 4 2 4 ..
## $ year : int 1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
## $ pop : num 5.56e+08 3.72e+08 1.58e+08 8.65e+07 8.21e+07 ...
## $ lifeExp : num 44 37.4 68.4 63 37.5 ...
## $ gdpPercap: num 400 547 13990 3217 750 ...
## $ color : chr "#40004B" "#460552" "#A50026" "#611A6D" ...
jYear <- c(1952, 2007)
yDat <- subset(kDat, year %in% jYear)
str(yDat)
## 'data.frame': 284 obs. of 7 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 25 59 135 67 60 48 1..
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 2 3 3 4 2 4 ..
## $ year : int 1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
## $ pop : num 5.56e+08 3.72e+08 1.58e+08 8.65e+07 8.21e+07 ...
## $ lifeExp : num 44 37.4 68.4 63 37.5 ...
## $ gdpPercap: num 400 547 13990 3217 750 ...
## $ color : chr "#40004B" "#460552" "#A50026" "#611A6D" ...
First start with a high level function call that works, but is missing some of the cool features you want. Crawl before you walk. Walk before you run.
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)))
My first step is to specify the panel function explicitly but, of course, to specify the default. Do you think pilots switch off auto-pilot in the middle of a delicate maneuver? No, they prefer to take manual control when the plane is flying smoothly at 30,000 feet in clear skies. You will see no change in the figure, which is exactly what we want at this point.
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
panel = panel.xyplot)
Now I write a custom panel function 'on the fly' but the function body just consists of the defaul panel function. Any arguments normally passed to the panel function are passed straight through to panel.xyplot()
. I omit the plot because there is no visible change, which is exactly what I want right now.
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
panel = function(...) {
panel.xyplot(...)
})
Now I modify the custom panel function to have arguments x
, y
, and ...
. If you read the help on panel.xyplot()
, you will notice that x
and y
are the first two formal arguments and they do not have default values. This is why, if I want to specify any of the other arguments, I must explicitly specify x
and y
. Below, I take whatever would have normally been given to the panel function as x
and y
and am passing them straight through. I omit the plot because there is no visible change, which is exactly what I want right now.
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
})
Now I take control of the plotting symbol: I make it an empty circle, which happens to be symbol 21. Conceptually, I scale the radius of the circle so that the area will reflect the country's population. The logic and testing of this mapping of population into circle size was done elsewhere and required some thought and care, FYI. The mechanics are a bit tricky: The term cex
(think "character expansion) is conventionally used in R as an argument controlling font or symbol size. That's the dial I need to turn to change circle size. I set the cex =
argument of the high level xyplot()
call to a vector corresponding to the rows of the input data yDat
. Since xyplot()
doesn't have a cex =
argument, it passes that down to the panel function, which is panel.xyplot()
in this case, which does have a cex =
argument. Then, inside the panel function, I consult the subscripts
argument to learn which rows of the original input data are in the current packet. I can then index cex
by these subscripts to get the corect radii and draw the circles. Phew.
jCexDivisor <- 1500 # arbitrary scaling constant
jPch <- 21
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
cex = sqrt(yDat$pop/pi)/jCexDivisor,
panel = function(x, y, ..., cex, subscripts) {
panel.xyplot(x, y, cex = cex[subscripts], pch = jPch, ...)
})
Now I fill the circles with the pre-chosen country colors and explicitly set the circle outlines to a nice tasteful grey. The fill color mechanics mirror those used for setting the circle size.
jDarkGray <- 'grey20'
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
cex = sqrt(yDat$pop/pi)/jCexDivisor, fill.color = yDat$color,
col = jDarkGray,
panel = function(x, y, ..., cex, fill.color, subscripts) {
panel.xyplot(x, y, cex = cex[subscripts],
pch = jPch, fill = fill.color[subscripts], ...)
})
Since I've come this far, I will go ahead and add a key to link the continents to colors (or color families, here). This has nothing to do with panel functions!
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderContinentColors.txt"
continentColors <- read.delim(file = gdURL, as.is = 3) # protect color
continentKey <-
with(continentColors,
list(x = 0.95, y = 0.05, corner = c(1, 0),
text = list(as.character(continent)),
points = list(pch = jPch, col = jDarkGray, fill = color)))
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
cex = sqrt(yDat$pop/pi)/jCexDivisor, fill.color = yDat$color,
col = jDarkGray, key = continentKey,
panel = function(x, y, ..., cex, fill.color, subscripts) {
panel.xyplot(x, y, cex = cex[subscripts],
pch = jPch, fill = fill.color[subscripts], ...)
})
To conclude, we present our starting and ending plots side by side.
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
grid = TRUE, scales = list(x = list(log = 10)))
xyplot(lifeExp ~ gdpPercap | factor(year), yDat, aspect = 2/3,
grid = TRUE, scales = list(x = list(log = 10, equispaced.log = FALSE)),
cex = sqrt(yDat$pop/pi)/jCexDivisor, fill.color = yDat$color,
col = jDarkGray, key = continentKey,
panel = function(x, y, ..., cex, fill.color, subscripts) {
panel.xyplot(x, y, cex = cex[subscripts],
pch = jPch, fill = fill.color[subscripts], ...)
})
The differences in the code are really not that huge, considering how different the output is.
Q: What if you want to plot 2 quantitative variables against another BUT they are on completely different scales? We got kind of lucky above with the different measures of spread -- there was considerable overlap in their observed values and therefore the extended formula interface and a single y-axis "just worked".
A: How to do this is demonstrated in a few places in Sarkar's lattice
book. This requires these manipulations:
type =
across the multiple primary variables via distribute.type = TRUE
DO IF NEEDEDSee section 5.3.1 (p. 79 - 82) and section 8.4.2 (p. 148 - 150).
Lattice: Multivariate Data Visualization with R available via SpringerLink by Deepayan Sarkar, Springer (2008) | all code from the book | GoogleBooks search
My step-by-step development of a custom panel function was heavily inspired by "Lattice Tricks for the Power UseR", a talk by Deepayan Sarkar at the 2007 UseR! conference.