Exploring a quantitative variable with lattice

We are going to focus on studying a single quantitative variable -- either alone or in conjunction with one or more categorical variables. We are going to use functions from the lattice package in this tutorial and will revisit this ground using ggplot2 shortly.

Optional getting started advice

Ignore if you don't need this bit of support.

This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the Gapminder project. Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an RStudio project. To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the getwd() command or by glancing at the top of RStudio's Console pane.

Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodicially copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "univariate plots" and "lattice".

Load the Gapminder data and lattice (and plyr)

Assuming the data can be found in the current working directory, this works:

gDat <- read.delim("gapminderDataFiveYear.txt")

Plan B (I use here, because of where the source of this tutorial lives):

## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)

Basic sanity check that the import has gone well:

str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Load the lattice and plyr packages.

library(lattice)
library(plyr)

Show me the data with stripplot()

Every time someone shows me a table of estimated regression coefficients or drops a naked p-value into their paper ... without ever showing me the raw data I want to scream: Show me the data! Convince me you have taken the time to look at it yourself before you plunged into testing and modelling! The "show me the money" scene from the movie Jerry Maguire should really be edited to replace "money" with "data".

A stripplot is a univariate scatterplot.

stripplot(~lifeExp, gDat)

It's silly to apply to such a large dataset: overplotting causes the data to look like a line segment. Let's look at less data.

hDat <- subset(gDat, year %in% c(1952, 2007))  # storing because we will re-use alot
stripplot(~lifeExp, hDat)

Still silly. But look what happens once we ask for stripplots for each continent.

stripplot(continent ~ lifeExp, hDat)

I would prefer to have the y-axis correspond to life expectancy and the x-axis to continent. So I just swap the variables' positions in the formula.

stripplot(lifeExp ~ continent, hDat)

Despite our focus on less data, we still have overplotting. It can help to add jitter, a small bit of meaningless noise, in the horizontal position, which corresponds to continent.

stripplot(lifeExp ~ continent, hDat, jitter.data = TRUE)

A horizontal grid would be nice here. (Notice how the data wiggles every time we plot due to the random nature of the jitter.)

stripplot(lifeExp ~ continent, hDat, jitter.data = TRUE, grid = "h")

I'm interested in the typical value for each continent, numerically and graphically. This is your first glimpse at how nicely all of these habits and tools play together: religious use of data.frames, keeping data "tidy", plyr and lattice and/or ggplot2.

# ddply(gDat, ~continent, function(x) summary(x$lifeExp))
ddply(hDat, ~continent, summarize, meanLifeExp = mean(lifeExp))
##   continent meanLifeExp
## 1    Africa       46.97
## 2  Americas       63.44
## 3      Asia       58.52
## 4    Europe       71.03
## 5   Oceania       74.99
stripplot(lifeExp ~ continent, hDat, jitter.data = TRUE, grid = "h", type = c("p", 
    "a"))

Eyeball-o-metrically verify that the reported averages match what you see on the plot. This sort of paranoid consistency check is critical in early data exploration. The a in the type = argument requests connect-the-dots for the continent specific averages. Under the hood, there is a fun = argument that merely defaults to mean(), thereby connecting the averages. If we want something else, we can specify fun = explicitly. Maybe I'm interested in the lowest life expectancies and want fun = min. And those low-outliers intrigue me ... which countries are they?

ddply(hDat, ~continent, function(x) {
    theMin <- which.min(x$lifeExp)
    x[theMin, c("country", "year", "continent", "lifeExp")]
})
##       country year continent lifeExp
## 1      Gambia 1952    Africa   30.00
## 2       Haiti 1952  Americas   37.58
## 3 Afghanistan 1952      Asia   28.80
## 4      Turkey 1952    Europe   43.59
## 5   Australia 1952   Oceania   69.12
stripplot(lifeExp ~ continent, hDat, jitter.data = TRUE, grid = "h", type = c("p", 
    "a"), fun = min)

I'm tired of dragging Oceania around, with its two measly countries. Let's remove all associated observations. To truly drop "Oceania" as a factor level, I use the droplevels() function (for more on being the boss of factors read this tutorial).

iDat <- subset(hDat, continent != "Oceania")
table(iDat$continent)  # Oceania is still there!
## 
##   Africa Americas     Asia   Europe  Oceania 
##      104       50       66       60        0
iDat <- droplevels(subset(hDat, continent != "Oceania"))
table(iDat$continent)  # Oceania is gone now
## 
##   Africa Americas     Asia   Europe 
##      104       50       66       60
stripplot(lifeExp ~ continent, iDat, jitter.data = TRUE, grid = "h", type = c("p", 
    "a"))

I'd like the continents to be in order of life expectancy, low to high. And yes I realize that's an ill-defined thing. You will have to make lots of judgement calls to get anywhere in data analysis. Embrace being The Decider. (For more on reordering factors read this tutorial).

ddply(iDat, ~continent, summarize, avgLifeExp = mean(lifeExp))
##   continent avgLifeExp
## 1    Africa      46.97
## 2  Americas      63.44
## 3      Asia      58.52
## 4    Europe      71.03
## Americas and Asia should be switched in the order of the levels
levels(reorder(iDat$continent, iDat$lifeExp))
## [1] "Africa"   "Asia"     "Americas" "Europe"
iDat <- within(iDat, continent <- reorder(continent, lifeExp))
ddply(iDat, ~continent, summarize, avgLifeExp = mean(lifeExp))  # YES!
##   continent avgLifeExp
## 1    Africa      46.97
## 2      Asia      58.52
## 3  Americas      63.44
## 4    Europe      71.03
stripplot(lifeExp ~ continent, iDat, jitter.data = TRUE, grid = "h", type = c("p", 
    "a"))

Much better! Now let's distinguish the two years of data with the very powerful groups = argument, which accepts a factor.

stripplot(lifeExp ~ continent, iDat, groups = year, jitter.data = TRUE, type = c("p", 
    "a"), fun = median)

But how do the colors correspond to year? We need a key. auto.key = TRUE is often good enough for quick plots.

stripplot(lifeExp ~ continent, iDat, groups = year, auto.key = TRUE, jitter.data = TRUE, 
    type = c("p", "a"), fun = median)

But I hate it when the key order has no relation to the data order or, especially, when they are exactly reversed, as they are here. The auto.key = argument can be logical (TRUE or FALSE) but it can be also be used for limited customization.

stripplot(lifeExp ~ continent, iDat, groups = year, auto.key = list(reverse.rows = TRUE), 
    jitter.data = TRUE, type = c("p", "a"), fun = median)

Direct labelling would be even better but requires more packages. Look into directlabels and latticedl, I think, for ggplot2 and lattice respectively.

Show me the distribution with densityplot()

Sometimes you have too much data to work with stripplot(). Or maybe you just want a better sense of where the data is truly concentrated. The next visualization you should try is densityplot().

densityplot(~lifeExp, iDat)

You can think of this as a smooth histogram. Under the hood, it is actually a kernel density estimate. By default, the data is portrayed down below as a "rug", but you may want to turn this off and request a simple reference line instead.

densityplot(~lifeExp, iDat, plot.points = FALSE, ref = TRUE)

One advantage over histograms is that it is easier to portray multiple distributions on one plot, so you can make comparisons. The group = argument can be used here as well.

densityplot(~lifeExp, iDat, plot.points = FALSE, ref = TRUE, group = continent, 
    auto.key = list(columns = nlevels(iDat$continent)))

You can also portray different sub-datasets in different panels, known as multi-panel conditioning. Specified on the far right of the formula following a vertical bar, as in y ~ x | z, the conditioning variable z must be a factor (or something that can be made into one) and you will get the requested type of plot of y against x for each level of z. Multi-panel conditioning is available in many high-level lattice functions.

densityplot(~lifeExp | factor(year), iDat)
t.test(lifeExp ~ year, iDat)
## 
##  Welch Two Sample t-test
## 
## data:  lifeExp by year
## t = -12.52, df = 278, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -20.88 -15.21
## sample estimates:
## mean in group 1952 mean in group 2007 
##              48.77              66.81

That's interesting! These distributions look very different than the overall distribution, underscoring the importance of incorporating important covariates into both your visualizations and your models.

Some arguments you may wish to tune on your densityplots are bw =, adjust =, and n =, which specify the bandwidth of the underlying kernel smoother and how fine a grid is used when the kernel density estimate is evaluated. To make the underlying kernel density estimate smoother, increase the bandwidth. You can specify it directly with bw = but it's usually easier to use adjust =. Under the hood, densityplot() gets the kernel density estimate from the density() function, which uses good algorithms to pick a default bandwidth. With adjust =, you can tweak the bandwidth in multiples of the default. It is really easy to confuse yourself with the n = argument: believe it or not, n = has nothing to do with sample size or the number of observations in your dataset. The kernel density estimate is evaluated at a grid of points spanning the observed data and then drawn in a connect-the-dots way. It will look better if the grid is fine, i.e. if there are lots of points. The argument n = controls the number of these points, so increase this if you see hard corners. Let's demonstrate.

hardCorners <- densityplot(~lifeExp, iDat, n = 20, main = "n = 20")
softCorners <- densityplot(~lifeExp, iDat, n = 200, main = "n = 200")
print(hardCorners, position = c(0, 0, 0.55, 1), more = TRUE)
print(softCorners, position = c(0.45, 0, 1, 1))
wiggly <- densityplot(~lifeExp, iDat, adjust = 0.5, main = "default bw * 0.5")
rolling <- densityplot(~lifeExp, iDat, adjust = 2, main = "default bw * 2")
print(wiggly, position = c(0, 0, 0.55, 1), more = TRUE)
print(rolling, position = c(0.45, 0, 1, 1))

histograms, box plots, violin plots, ecdf plots

histogram(~lifeExp, gDat)
histogram(~lifeExp, gDat, nint = 50)
bwplot(lifeExp ~ continent, iDat)
bwplot(lifeExp ~ continent, iDat, panel = panel.violin)
bwplot(lifeExp ~ as.factor(year) | continent, subset(gDat, continent != "Oceania"))
bwplot(lifeExp ~ reorder(continent, lifeExp), subset(gDat, continent != "Oceania"), 
    panel = function(..., box.ratio) {
        panel.violin(..., col = "transparent", border = "grey60", varwidth = FALSE, 
            box.ratio = box.ratio)
        panel.bwplot(..., fill = NULL, box.ratio = 0.1)
    })

Q and A

Student: the connect-the-dots portrayal of continent specific averages on the stripplot bothers me, since continent isn't a truly quantitative variable.

Here's the sort of plot we're talking about:

stripplot(lifeExp ~ continent, iDat, jitter.data = TRUE, grid = "h", type = c("p", 
    "a"))

(Sort of) an answer: when type = includes a, panel.average() is being invoked behind the scenes. Here's the description from documentation: "panel.average treats one of x and y as a factor (according to the value of horizontal), calculates fun applied to the subsets of the other variable determined by each unique value of the factor, and joins them by a line. Can be used in conjunction with panel.xyplot, and more commonly with panel.superpose to produce interaction plots." If you look at the source code for panel.average() you can see that the connect-the-dots nature of it is completely hard-wired. So to portray the values returned by fun differently, such as a very large point or a thick horizontal bar, would require some custom work. Too bad. You will get more control of this sort of thing in ggplot2.

References

Lattice: Multivariate Data Visualization with R available via SpringerLink by Deepayan Sarkar, Springer (2008) | all code from the book | GoogleBooks search