Scatterplots with lattice

Scatterplots with `lattice`

We focus on studying the relationship between two quantitative variables -- possibly in conjunction with one or more categorical variables. We use the xyplot() function from the lattice package in this tutorial and will revisit this ground using ggplot2 shortly.

Optional getting started advice

Ignore if you don't need this bit of support.

This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the Gapminder project. Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an RStudio project. To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the getwd() command or by glancing at the top of RStudio's Console pane.

Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodicially copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "scatter plots" and "lattice".

Load the Gapminder data, drop Oceania, load packages

Assuming the data can be found in the current working directory, this works:

gDat <- read.delim("gapminderDataFiveYear.txt")

Plan B (I use here, because of where the source of this tutorial lives):

## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)

Basic sanity check that the import has gone well:

str(gDat)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Drop Oceania, which only has two countries

## drop Oceania
jDat <- droplevels(subset(gDat, continent != "Oceania"))
str(jDat)

## 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Load the lattice package.

library(lattice)

Show me the data with `xyplot()`

Get a scatterplot with xyplot(). Background grids are nice. In addition to grid = TRUE, grid = "h" and grid = "v" are other useful shortcuts.

xyplot(lifeExp ~ gdpPercap, jDat)
xyplot(lifeExp ~ gdpPercap, jDat,
       grid = TRUE)

Clearly we need to log transform the horizontal axis, i.e. the one for GDP per capita. You can take the logarithm directly in the formula, but it's sub-optimal due to the axis tick labels, which are not so easy to read. It's better to request the log transform via the scales = argument.

## log, the sub-optimal way
xyplot(lifeExp ~ log10(gdpPercap), jDat,
       grid = TRUE)
## logging, better way ... step 1
xyplot(lifeExp ~ gdpPercap, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10)))

There are many other log axis labelling strategies. The next easiest thing is to specify equispaced.log = FALSE Even more is possible with functions in the latticeExtra package; see some examples in these slides. I wish the tick marks and grid lined up, but I'm trying not to let the perfect be the enemy of the good. Let's press on.

xyplot(lifeExp ~ gdpPercap, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)))

`type =`

The type = argument can be used to enhance the figure with data-responsive elements. On the left, we specify the default value type = "p", which requests only points, and on the right, we request a simple linear regression with type = c("p", "r").

xyplot(lifeExp ~ gdpPercap, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       type = "p")
xyplot(lifeExp ~ gdpPercap, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       type = c("p", "r"))

Let's make line that more obnoxious and more orange. And let's try a different overlay -- namely, a smooth regression via type = c("p", "smooth"). In this case the difference is pretty subtle.

xyplot(lifeExp ~ gdpPercap, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       type = c("p", "r"), col.line = "darkorange", lwd = 3)
xyplot(lifeExp ~ gdpPercap, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       type = c("p", "smooth"), col.line = "darkorange", lwd = 3)

`group =`

If you specify a grouping variable via group =, the data from the various groups will be superposed and visually distinguished. A simple key is obtained with auto.key = TRUE, but there are many alternative ways to customize the key. We won't fuss with that on this quick tour.

xyplot(lifeExp ~ gdpPercap, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       group = continent)

## auto.key
xyplot(lifeExp ~ gdpPercap, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       group = continent, auto.key = TRUE)

The group = argument can be used together with the type = argument.

## groups + type "smooth"
xyplot(lifeExp ~ gdpPercap, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       group = continent, auto.key = TRUE,
       type = c("p", "smooth"), lwd = 4)
## making key more compact
xyplot(lifeExp ~ gdpPercap, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       group = continent, auto.key = list(columns = nlevels(jDat$continent)),
       type = c("p", "smooth"), lwd = 4)

Multi-panel conditioning

If you would rather see data for different groups in separate panels, instead of superposed, specify the grouping factor z like so: y ~ x | z. This is called multi-panel conditioning. Sometimes I specify the same factor as the conditioning and grouping variable, just because I like to keep the color scheme constant and visible.

xyplot(lifeExp ~ gdpPercap | continent, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)))

xyplot(lifeExp ~ gdpPercap | continent, jDat,
       group = continent,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)))

You can use conditioning, grouping, and type = in various combinations.

## conditioning + type "r" or "smooth"
xyplot(lifeExp ~ gdpPercap | continent, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       type = c("p", "smooth"), col.line = "darkorange", lwd = 4)
xyplot(lifeExp ~ gdpPercap | continent, jDat,
       grid = TRUE, group = continent,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       type = c("p", "smooth"), lwd = 4)

Combat overplotting

In a univariate setting, we fought overplotting with jitter, before we turned to entirely different strategies, such as the kernel density estimation that underpins densityplot(). In scatterplots, the entry level solution to overplotting is to use alpha transparency for the points. I do that on the left via alpha = 1/2. Here's how to think of this: if alpha transparency is set to \(1/k\) then an area becomes opaque when \(k\) points overlap. (My lattice default specifies pch = 16, if you want to specify that in your call. Warning: not all graphics devices support alpha transparency.)

xyplot(lifeExp ~ gdpPercap | continent, jDat,
       grid = TRUE, 
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       type = c("p", "smooth"), lwd = 4, alpha = 1/2)
xyplot(lifeExp ~ gdpPercap | continent, jDat,
       grid = TRUE,
       scales = list(x = list(log = 10, equispaced.log = FALSE)),
       panel = panel.smoothScatter)

On the right I actually request a non-default panel function, panel.smoothScatter() that is, in fact, based on 2D kernel density estimation. There's more about panel functions here.

Finally, hexagonal binning is an interesting way to handle overplotting. The plane is divided into a hexagonal lattice and the individual hexagons are shaded according to the number of point that fall within. This requires an add-on library hexbin which you will need to install. Then you can use hexbinplo() as almost a drop-in substitute for xyplot().

#install.packages("hexbin", dependencies = TRUE)
library(hexbin)
hexbinplot(lifeExp ~ gdpPercap, jDat,
           scales = list(x = list(log = 10, equispaced.log = FALSE)),
           aspect = 1, bins=50)

And thus endeth the quick tour of xyplot().

References

Lattice: Multivariate Data Visualization with R available via SpringerLink by Deepayan Sarkar, Springer (2008) | all code from the book | GoogleBooks search

Ch. 5 Scatter Plots and Extensions is most relevant to this tutorial