lattice
We are going to focus on studying a single quantitative variable -- either alone or in conjunction with one or more categorical variables. We are going to use functions from the lattice
package in this tutorial and will revisit this ground using ggplot2
shortly.
Ignore if you don't need this bit of support.
This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the Gapminder project. Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an RStudio project. To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the getwd()
command or by glancing at the top of RStudio's Console pane.
Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodicially copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "univariate plots" and "lattice".
lattice
(and plyr
)Assuming the data can be found in the current working directory, this works:
gDat <- read.delim("gapminderDataFiveYear.txt")
Plan B (I use here, because of where the source of this tutorial lives):
## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
Basic sanity check that the import has gone well:
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
Load the lattice
and plyr
packages.
library(lattice)
library(plyr)
stripplot()
Every time someone shows me a table of estimated regression coefficients or drops a naked p-value into their paper ... without ever showing me the raw data I want to scream: Show me the data! Convince me you have taken the time to look at it yourself before you plunged into testing and modelling! The "show me the money" scene from the movie Jerry Maguire should really be edited to replace "money" with "data".
A stripplot is a univariate scatterplot.
stripplot(~lifeExp, gDat)
It's silly to apply to such a large dataset: overplotting causes the data to look like a line segment. Let's look at less data.
hDat <- subset(gDat, year %in% c(1952, 2007)) # storing because we will re-use alot
stripplot(~lifeExp, hDat)
Still silly. But look what happens once we ask for stripplots for each continent.
stripplot(continent ~ lifeExp, hDat)
I would prefer to have the y-axis correspond to life expectancy and the x-axis to continent. So I just swap the variables' positions in the formula.
stripplot(lifeExp ~ continent, hDat)
Despite our focus on less data, we still have overplotting. It can help to add jitter, a small bit of meaningless noise, in the horizontal position, which corresponds to continent.
stripplot(lifeExp ~ continent, hDat, jitter.data = TRUE)
A horizontal grid would be nice here. (Notice how the data wiggles every time we plot due to the random nature of the jitter.)
stripplot(lifeExp ~ continent, hDat, jitter.data = TRUE, grid = "h")
I'm interested in the typical value for each continent, numerically and graphically. This is your first glimpse at how nicely all of these habits and tools play together: religious use of data.frames, keeping data "tidy", plyr
and lattice
and/or ggplot2
.
# ddply(gDat, ~continent, function(x) summary(x$lifeExp))
ddply(hDat, ~continent, summarize, meanLifeExp = mean(lifeExp))
## continent meanLifeExp
## 1 Africa 46.97
## 2 Americas 63.44
## 3 Asia 58.52
## 4 Europe 71.03
## 5 Oceania 74.99
stripplot(lifeExp ~ continent, hDat, jitter.data = TRUE, grid = "h", type = c("p",
"a"))
Eyeball-o-metrically verify that the reported averages match what you see on the plot. This sort of paranoid consistency check is critical in early data exploration. The a
in the type =
argument requests connect-the-dots for the continent specific averages. Under the hood, there is a fun =
argument that merely defaults to mean()
, thereby connecting the averages. If we want something else, we can specify fun =
explicitly. Maybe I'm interested in the lowest life expectancies and want fun = min
. And those low-outliers intrigue me ... which countries are they?
ddply(hDat, ~continent, function(x) {
theMin <- which.min(x$lifeExp)
x[theMin, c("country", "year", "continent", "lifeExp")]
})
## country year continent lifeExp
## 1 Gambia 1952 Africa 30.00
## 2 Haiti 1952 Americas 37.58
## 3 Afghanistan 1952 Asia 28.80
## 4 Turkey 1952 Europe 43.59
## 5 Australia 1952 Oceania 69.12
stripplot(lifeExp ~ continent, hDat, jitter.data = TRUE, grid = "h", type = c("p",
"a"), fun = min)
I'm tired of dragging Oceania around, with its two measly countries. Let's remove all associated observations. To truly drop "Oceania" as a factor level, I use the droplevels()
function (for more on being the boss of factors read this tutorial).
iDat <- subset(hDat, continent != "Oceania")
table(iDat$continent) # Oceania is still there!
##
## Africa Americas Asia Europe Oceania
## 104 50 66 60 0
iDat <- droplevels(subset(hDat, continent != "Oceania"))
table(iDat$continent) # Oceania is gone now
##
## Africa Americas Asia Europe
## 104 50 66 60
stripplot(lifeExp ~ continent, iDat, jitter.data = TRUE, grid = "h", type = c("p",
"a"))
I'd like the continents to be in order of life expectancy, low to high. And yes I realize that's an ill-defined thing. You will have to make lots of judgement calls to get anywhere in data analysis. Embrace being The Decider. (For more on reordering factors read this tutorial).
ddply(iDat, ~continent, summarize, avgLifeExp = mean(lifeExp))
## continent avgLifeExp
## 1 Africa 46.97
## 2 Americas 63.44
## 3 Asia 58.52
## 4 Europe 71.03
## Americas and Asia should be switched in the order of the levels
levels(reorder(iDat$continent, iDat$lifeExp))
## [1] "Africa" "Asia" "Americas" "Europe"
iDat <- within(iDat, continent <- reorder(continent, lifeExp))
ddply(iDat, ~continent, summarize, avgLifeExp = mean(lifeExp)) # YES!
## continent avgLifeExp
## 1 Africa 46.97
## 2 Asia 58.52
## 3 Americas 63.44
## 4 Europe 71.03
stripplot(lifeExp ~ continent, iDat, jitter.data = TRUE, grid = "h", type = c("p",
"a"))
Much better! Now let's distinguish the two years of data with the very powerful groups =
argument, which accepts a factor.
stripplot(lifeExp ~ continent, iDat, groups = year, jitter.data = TRUE, type = c("p",
"a"), fun = median)
But how do the colors correspond to year? We need a key. auto.key = TRUE
is often good enough for quick plots.
stripplot(lifeExp ~ continent, iDat, groups = year, auto.key = TRUE, jitter.data = TRUE,
type = c("p", "a"), fun = median)
But I hate it when the key order has no relation to the data order or, especially, when they are exactly reversed, as they are here. The auto.key =
argument can be logical (TRUE or FALSE) but it can be also be used for limited customization.
stripplot(lifeExp ~ continent, iDat, groups = year, auto.key = list(reverse.rows = TRUE),
jitter.data = TRUE, type = c("p", "a"), fun = median)
Direct labelling would be even better but requires more packages. Look into
directlabels
andlatticedl
, I think, forggplot2
andlattice
respectively.
densityplot()
Sometimes you have too much data to work with stripplot()
. Or maybe you just want a better sense of where the data is truly concentrated. The next visualization you should try is densityplot()
.
densityplot(~lifeExp, iDat)
You can think of this as a smooth histogram. Under the hood, it is actually a kernel density estimate. By default, the data is portrayed down below as a "rug", but you may want to turn this off and request a simple reference line instead.
densityplot(~lifeExp, iDat, plot.points = FALSE, ref = TRUE)
One advantage over histograms is that it is easier to portray multiple distributions on one plot, so you can make comparisons. The group =
argument can be used here as well.
densityplot(~lifeExp, iDat, plot.points = FALSE, ref = TRUE, group = continent,
auto.key = list(columns = nlevels(iDat$continent)))
You can also portray different sub-datasets in different panels, known as multi-panel conditioning. Specified on the far right of the formula following a vertical bar, as in y ~ x | z
, the conditioning variable z
must be a factor (or something that can be made into one) and you will get the requested type of plot of y
against x
for each level of z
. Multi-panel conditioning is available in many high-level lattice
functions.
densityplot(~lifeExp | factor(year), iDat)
t.test(lifeExp ~ year, iDat)
##
## Welch Two Sample t-test
##
## data: lifeExp by year
## t = -12.52, df = 278, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -20.88 -15.21
## sample estimates:
## mean in group 1952 mean in group 2007
## 48.77 66.81
That's interesting! These distributions look very different than the overall distribution, underscoring the importance of incorporating important covariates into both your visualizations and your models.
Some arguments you may wish to tune on your densityplots are bw =
, adjust =
, and n =
, which specify the bandwidth of the underlying kernel smoother and how fine a grid is used when the kernel density estimate is evaluated. To make the underlying kernel density estimate smoother, increase the bandwidth. You can specify it directly with bw =
but it's usually easier to use adjust =
. Under the hood, densityplot()
gets the kernel density estimate from the density()
function, which uses good algorithms to pick a default bandwidth. With adjust =
, you can tweak the bandwidth in multiples of the default. It is really easy to confuse yourself with the n =
argument: believe it or not, n =
has nothing to do with sample size or the number of observations in your dataset. The kernel density estimate is evaluated at a grid of points spanning the observed data and then drawn in a connect-the-dots way. It will look better if the grid is fine, i.e. if there are lots of points. The argument n =
controls the number of these points, so increase this if you see hard corners. Let's demonstrate.
hardCorners <- densityplot(~lifeExp, iDat, n = 20, main = "n = 20")
softCorners <- densityplot(~lifeExp, iDat, n = 200, main = "n = 200")
print(hardCorners, position = c(0, 0, 0.55, 1), more = TRUE)
print(softCorners, position = c(0.45, 0, 1, 1))
wiggly <- densityplot(~lifeExp, iDat, adjust = 0.5, main = "default bw * 0.5")
rolling <- densityplot(~lifeExp, iDat, adjust = 2, main = "default bw * 2")
print(wiggly, position = c(0, 0, 0.55, 1), more = TRUE)
print(rolling, position = c(0.45, 0, 1, 1))
histogram(~lifeExp, gDat)
histogram(~lifeExp, gDat, nint = 50)
bwplot(lifeExp ~ continent, iDat)
bwplot(lifeExp ~ continent, iDat, panel = panel.violin)
bwplot(lifeExp ~ as.factor(year) | continent, subset(gDat, continent != "Oceania"))
bwplot(lifeExp ~ reorder(continent, lifeExp), subset(gDat, continent != "Oceania"),
panel = function(..., box.ratio) {
panel.violin(..., col = "transparent", border = "grey60", varwidth = FALSE,
box.ratio = box.ratio)
panel.bwplot(..., fill = NULL, box.ratio = 0.1)
})
Student: the connect-the-dots portrayal of continent specific averages on the stripplot bothers me, since continent isn't a truly quantitative variable.
Here's the sort of plot we're talking about:
stripplot(lifeExp ~ continent, iDat, jitter.data = TRUE, grid = "h", type = c("p",
"a"))
(Sort of) an answer: when type =
includes a
, panel.average()
is being invoked behind the scenes. Here's the description from documentation: "panel.average
treats one of x and y as a factor (according to the value of horizontal), calculates fun
applied to the subsets of the other variable determined by each unique value of the factor, and joins them by a line. Can be used in conjunction with panel.xyplot
, and more commonly with panel.superpose
to produce interaction plots." If you look at the source code for panel.average()
you can see that the connect-the-dots nature of it is completely hard-wired. So to portray the values returned by fun
differently, such as a very large point or a thick horizontal bar, would require some custom work. Too bad. You will get more control of this sort of thing in ggplot2
.
Lattice: Multivariate Data Visualization with R available via SpringerLink by Deepayan Sarkar, Springer (2008) | all code from the book | GoogleBooks search