(Older material on) Data aggregation

The remnants of a previous tutorial that prioritized built-in apply type functiond for data aggregation, versus using plyr. No promises that this document will read very smoothly or be complete!

Load the Gapminder data

Assuming the data can be found in the current working directory, this works:

gDat <- read.delim("gapminderDataFiveYear.txt")

Plan B (I use here, because of where the source of this tutorial lives):

## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)

Basic sanity check that the import has gone well:

str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Data aggregation

If you feel the urge to store a little snippet of a data.frame:

(snippet <- subset(gDat, country == "Canada"))
##     country year      pop continent lifeExp gdpPercap
## 241  Canada 1952 14785584  Americas   68.75     11367
## 242  Canada 1957 17010154  Americas   69.96     12490
## 243  Canada 1962 18985849  Americas   71.30     13462
## 244  Canada 1967 20819767  Americas   72.13     16077
## 245  Canada 1972 22284500  Americas   72.88     18971
## 246  Canada 1977 23796400  Americas   74.21     22091
## 247  Canada 1982 25201900  Americas   75.76     22899
## 248  Canada 1987 26549700  Americas   76.86     26627
## 249  Canada 1992 28523502  Americas   77.95     26343
## 250  Canada 1997 30305843  Americas   78.61     28955
## 251  Canada 2002 31902268  Americas   79.77     33329
## 252  Canada 2007 33390141  Americas   80.65     36319

Stop and ask yourself ... >Do I want to create sub-data.frames for each level of some factor (or unique combination of several factors) ... in order to compute or graph something?

If NO, then maybe you really do need to store a copy of a subset of the data.frame. But seriously consider whether you can achieve your goals by simple using the subset = argument to limit the rows a function will act on. If this still does not suit your needs, then maybe you should use subset() as shown above and carry on.

If YES, use data aggregation techniques or conditioning in lattice or facetting in ggplot2 plots -- don’t subset the data.frame. Or, to be totally clear, only subset the data.frame as a temporary measure as you develop your elegant code for computing on or visualizing these sub-data.frames.

For those situations when you need to so something for various chunks of your dataset, the best method depends on the nature of these chunks

chunks are ... relevant functions

rows, columns, etc. of matrix/array

apply

components of a list (includes variables in a data.frame!)

sapply, lapply

induced by levels of one or more factor(s)

tapply, by, split (+ [sl]apply, aggregate

Example of computing summaries for rows and columns of a matrix

(jCountries <- sort(c("Canada", "United States", "Mexico")))
## [1] "Canada"        "Mexico"        "United States"
tinyDat <- subset(gDat, country %in% jCountries)
str(tinyDat)  # 'data.frame': 36 obs. of 6 variables:
## 'data.frame':    36 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 21 21 21 21 21 21 21..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  14785584 17010154 18985849 20819767 22284500 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 ..
##  $ lifeExp  : num  68.8 70 71.3 72.1 72.9 ...
##  $ gdpPercap: num  11367 12490 13462 16077 18971 ...
(nY <- length(unique(tinyDat$year)))  # 12 years
## [1] 12
jLifeExp <- matrix(tinyDat$lifeExp, nrow = nY)
colnames(jLifeExp) <- jCountries
rownames(jLifeExp) <- tinyDat$year[1:nY]
jLifeExp
##      Canada Mexico United States
## 1952  68.75  50.79         68.44
## 1957  69.96  55.19         69.49
## 1962  71.30  58.30         70.21
## 1967  72.13  60.11         70.76
## 1972  72.88  62.36         71.34
## 1977  74.21  65.03         73.38
## 1982  75.76  67.41         74.65
## 1987  76.86  69.50         75.02
## 1992  77.95  71.45         76.09
## 1997  78.61  73.67         76.81
## 2002  79.77  74.90         77.31
## 2007  80.65  76.19         78.24
apply(jLifeExp, 1, mean)
##  1952  1957  1962  1967  1972  1977  1982  1987  1992  1997  2002  2007 
## 62.66 64.88 66.60 67.67 68.86 70.87 72.61 73.79 75.17 76.36 77.33 78.36
rowMeans(jLifeExp)  #see also rowSums, colMeans, colSums
##  1952  1957  1962  1967  1972  1977  1982  1987  1992  1997  2002  2007 
## 62.66 64.88 66.60 67.67 68.86 70.87 72.61 73.79 75.17 76.36 77.33 78.36
apply(jLifeExp, 2, median)
##        Canada        Mexico United States 
##         74.98         66.22         74.02
jCountries[apply(jLifeExp, 1, which.max)]
##  [1] "Canada" "Canada" "Canada" "Canada" "Canada" "Canada" "Canada"
##  [8] "Canada" "Canada" "Canada" "Canada" "Canada"

Operating on each variable in a data.frame -- a bit awkward because the whole point of data.frames is to hold disparate variables, so the same functions don't always make sense to apply to every variable in a data.frame. But I press on.

sapply(gDat, summary)  # artificial because summary(gDat) achieves same
## $country
##              Afghanistan                  Albania                  Algeria 
##                       12                       12                       12 
##                   Angola                Argentina                Australia 
##                       12                       12                       12 
##                  Austria                  Bahrain               Bangladesh 
##                       12                       12                       12 
##                  Belgium                    Benin                  Bolivia 
##                       12                       12                       12 
##   Bosnia and Herzegovina                 Botswana                   Brazil 
##                       12                       12                       12 
##                 Bulgaria             Burkina Faso                  Burundi 
##                       12                       12                       12 
##                 Cambodia                 Cameroon                   Canada 
##                       12                       12                       12 
## Central African Republic                     Chad                    Chile 
##                       12                       12                       12 
##                    China                 Colombia                  Comoros 
##                       12                       12                       12 
##         Congo, Dem. Rep.              Congo, Rep.               Costa Rica 
##                       12                       12                       12 
##            Cote d'Ivoire                  Croatia                     Cuba 
##                       12                       12                       12 
##           Czech Republic                  Denmark                 Djibouti 
##                       12                       12                       12 
##       Dominican Republic                  Ecuador                    Egypt 
##                       12                       12                       12 
##              El Salvador        Equatorial Guinea                  Eritrea 
##                       12                       12                       12 
##                 Ethiopia                  Finland                   France 
##                       12                       12                       12 
##                    Gabon                   Gambia                  Germany 
##                       12                       12                       12 
##                    Ghana                   Greece                Guatemala 
##                       12                       12                       12 
##                   Guinea            Guinea-Bissau                    Haiti 
##                       12                       12                       12 
##                 Honduras         Hong Kong, China                  Hungary 
##                       12                       12                       12 
##                  Iceland                    India                Indonesia 
##                       12                       12                       12 
##                     Iran                     Iraq                  Ireland 
##                       12                       12                       12 
##                   Israel                    Italy                  Jamaica 
##                       12                       12                       12 
##                    Japan                   Jordan                    Kenya 
##                       12                       12                       12 
##         Korea, Dem. Rep.              Korea, Rep.                   Kuwait 
##                       12                       12                       12 
##                  Lebanon                  Lesotho                  Liberia 
##                       12                       12                       12 
##                    Libya               Madagascar                   Malawi 
##                       12                       12                       12 
##                 Malaysia                     Mali               Mauritania 
##                       12                       12                       12 
##                Mauritius                   Mexico                 Mongolia 
##                       12                       12                       12 
##               Montenegro                  Morocco               Mozambique 
##                       12                       12                       12 
##                  Myanmar                  Namibia                    Nepal 
##                       12                       12                       12 
##              Netherlands              New Zealand                Nicaragua 
##                       12                       12                       12 
##                    Niger                  Nigeria                   Norway 
##                       12                       12                       12 
##                     Oman                 Pakistan                   Panama 
##                       12                       12                       12 
##                  (Other) 
##                      516 
## 
## $year
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1950    1970    1980    1980    1990    2010 
## 
## $pop
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 6.00e+04 2.79e+06 7.02e+06 2.96e+07 1.96e+07 1.32e+09 
## 
## $continent
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24 
## 
## $lifeExp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    23.6    48.2    60.7    59.5    70.8    82.6 
## 
## $gdpPercap
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     241    1200    3530    7220    9330  114000
sapply(gDat, is.numeric)
##   country      year       pop continent   lifeExp gdpPercap 
##     FALSE      TRUE      TRUE     FALSE      TRUE      TRUE
sapply(gDat, function(x) sum(is.na(x)))
##   country      year       pop continent   lifeExp gdpPercap 
##         0         0         0         0         0         0

I can get a new data.frame holding only the numeric variables.

gDatNum <- subset(gDat, select = sapply(gDat, is.numeric))
str(gDatNum)
## 'data.frame':    1704 obs. of  4 variables:
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Gives a better stage for showing off sapply() and lapply()

sapply(gDatNum, median)
##      year       pop   lifeExp gdpPercap 
## 1.980e+03 7.024e+06 6.071e+01 3.532e+03
lapply(gDatNum, median)
## $year
## [1] 1980
## 
## $pop
## [1] 7023596
## 
## $lifeExp
## [1] 60.71
## 
## $gdpPercap
## [1] 3532

Note that sapply() attempts to tidy up after itself, where as lapply() always returns a list. Notice how plays out when your computing more than one number

sapply(gDatNum, range)
##      year       pop lifeExp gdpPercap
## [1,] 1952 6.001e+04    23.6     241.2
## [2,] 2007 1.319e+09    82.6  113523.1
lapply(gDatNum, range)
## $year
## [1] 1952 2007
## 
## $pop
## [1] 6.001e+04 1.319e+09
## 
## $lifeExp
## [1] 23.6 82.6
## 
## $gdpPercap
## [1]    241.2 113523.1

Let's say we want to get the maximum life expectancy for each continent.

tapply(gDat$lifeExp, gDat$continent, max)  # a drag to type
##   Africa Americas     Asia   Europe  Oceania 
##    76.44    80.65    82.60    81.76    81.23
with(gDat, tapply(lifeExp, continent, max))
##   Africa Americas     Asia   Europe  Oceania 
##    76.44    80.65    82.60    81.76    81.23

The function you want to apply to the life expectancies for each continent can be built-in, like max() above, a custom function you've written, or a custom function specified 'on the fly'. Here's how I would count the number of countries in this dataset for each continent.

with(gDat, tapply(country, continent, function(x) {
    length(unique(x))
}))
##   Africa Americas     Asia   Europe  Oceania 
##       52       25       33       30        2

Unfortunately the output of tapply() often requires tidying.

(rangeLifeExp <- with(gDat, tapply(lifeExp, continent, range)))
## $Africa
## [1] 23.60 76.44
## 
## $Americas
## [1] 37.58 80.65
## 
## $Asia
## [1] 28.8 82.6
## 
## $Europe
## [1] 43.59 81.76
## 
## $Oceania
## [1] 69.12 81.23
str(rangeLifeExp)
## List of 5
##  $ Africa  : num [1:2] 23.6 76.4
##  $ Americas: num [1:2] 37.6 80.7
##  $ Asia    : num [1:2] 28.8 82.6
##  $ Europe  : num [1:2] 43.6 81.8
##  $ Oceania : num [1:2] 69.1 81.2
##  - attr(*, "dim")= int 5
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:5] "Africa" "Americas" "Asia" "Europe" ...

It would be nice to have a matrix (or data.frame) with one row (or column) per continent and one column (or row) for the min life exp and one for max. You can stack up rows and columns to make matrices and data.frames, assuming the type of data and dimensions are tractable.

rbind(rangeLifeExp[[1]], rangeLifeExp[[2]], rangeLifeExp[[3]], rangeLifeExp[[4]], 
    rangeLifeExp[[5]])
##       [,1]  [,2]
## [1,] 23.60 76.44
## [2,] 37.58 80.65
## [3,] 28.80 82.60
## [4,] 43.59 81.76
## [5,] 69.12 81.23

Problem is ...this approach does not scale well at all. Will soon drive you nuts. There is an obscure-sounding but extremely useful alternative.

leByCont <- do.call(rbind, rangeLifeExp)
colnames(leByCont) <- c("min", "max")
leByCont
##            min   max
## Africa   23.60 76.44
## Americas 37.58 80.65
## Asia     28.80 82.60
## Europe   43.59 81.76
## Oceania  69.12 81.23

We might still need to give decent column (variable) names, like "min" and "max", but this is still progress. So lapply() had a tidied up version sapply() but the analogous function does not exist for tapply(). It's up to you and do.call() helps a lot.

This issue of hard-to-predict, less-than-ideal stuff being returned from data aggregation functions is significant. The add-on package plyr addresses this and more. I am just starting to use it now and I recommend you wise up faster than I did and do the same.

Let's try it out. Also gives us a chance to install an add-on package. Here's how to do so at the command line. I save these commands in a script, like I do everything else, so that it's easy for me re-install all the add-on packages I like whenever I update R.

install.packages(pkgs = "plyr")

There are also RStudio-y ways to install packages. Once a package is installed, it still has to be loaded (unless it's one of the automatically loaded base packages). You can make all sorts of persistent changes to your set-up, including which packages are loaded by default, via .Rprofile. For now, we'll just load 'by hand'.

library(plyr)
# library(help = 'plyr')

Now we'll emulate what we did above using the ddply() function from plyr.

ddply(gDat, .(continent), summarise, median = median(lifeExp))
##   continent median
## 1    Africa  47.79
## 2  Americas  67.05
## 3      Asia  61.79
## 4    Europe  72.24
## 5   Oceania  73.66
leByCont2 <- ddply(gDat, .(continent), summarise, min = min(lifeExp), max = max(lifeExp))
leByCont2
##   continent   min   max
## 1    Africa 23.60 76.44
## 2  Americas 37.58 80.65
## 3      Asia 28.80 82.60
## 4    Europe 43.59 81.76
## 5   Oceania 69.12 81.23

I now return to using built-in data aggregation functions, because that's what I know best.

tapply() could only compute an a single variable -- lifeExp in examples above. If you need to work with multiple variables consider by(). Here's how I would do simple linear regression of life expectancy on year for each country. I'm also storing just the estimated parameters, ie intercept and slope. We'll return to this towards the end of the day.

(yearMin <- min(gDat$year))
## [1] 1952
coefEst <- by(gDat, gDat$country, function(cty) {
    coef(lm(lifeExp ~ I(year - yearMin), data = cty))
})
head(coefEst)
## $Afghanistan
##       (Intercept) I(year - yearMin) 
##           29.9073            0.2753 
## 
## $Albania
##       (Intercept) I(year - yearMin) 
##           59.2291            0.3347 
## 
## $Algeria
##       (Intercept) I(year - yearMin) 
##           43.3750            0.5693 
## 
## $Angola
##       (Intercept) I(year - yearMin) 
##           32.1267            0.2093 
## 
## $Argentina
##       (Intercept) I(year - yearMin) 
##           62.6884            0.2317 
## 
## $Australia
##       (Intercept) I(year - yearMin) 
##           68.4005            0.2277

We need to clean up again.

coefEst <- data.frame(do.call(rbind, coefEst))
head(coefEst)
##             X.Intercept. I.year...yearMin.
## Afghanistan        29.91            0.2753
## Albania            59.23            0.3347
## Algeria            43.37            0.5693
## Angola             32.13            0.2093
## Argentina          62.69            0.2317
## Australia          68.40            0.2277

I'd rather have the country names as an actual variable (vs. as row names) and the variable names are awful.

coefEst <- data.frame(country = factor(rownames(coefEst), levels = levels(gDat$country)), 
    coefEst)
names(coefEst) <- c("country", "intercept", "slope")
rownames(coefEst) <- NULL
head(coefEst)
##       country intercept  slope
## 1 Afghanistan     29.91 0.2753
## 2     Albania     59.23 0.3347
## 3     Algeria     43.37 0.5693
## 4      Angola     32.13 0.2093
## 5   Argentina     62.69 0.2317
## 6   Australia     68.40 0.2277

To really polish this off, I wish I had the continent information. match() is very useful for table look-up tasks.

foo <- match(coefEst$country, gDat$country)
head(foo)
## [1]  1 13 25 37 49 61

foo says that the first instance of "Afghanistan" as a country in gDat is in row 1. The first instance of "Albania" in row 13, etc etc. So with those row numbers in hand, I can index the continent variable from gDat to get a continent variable suitable for use in coefEst.

head(gDat$country[foo])
## [1] Afghanistan Albania     Algeria     Angola      Argentina   Australia  
## 142 Levels: Afghanistan Albania Algeria Angola Argentina ... Zimbabwe

I could add this to coefEst like so:

coefEst$continent <- gDat$continent[foo]

merge() is another function that is useful for these operations.

Data reshaping

Frequently data needs to be reshaped. General from short-and-fat to tall-and-skinny. In my life the most common reasons for this are for modelling and graphing. Consider the matrix we made earlier giving the min and max (variables or columns) life expectancy for each of the 5 continents (rows or observations). Imagine I want to make a figure with min = 0 , max = 1 on the x axis, life expectancy on the yaxis, and connect the two dots for each continent. It turns out this is much easier if the data is rearranged.

leByContTall <- data.frame(lifeExp = as.vector(leByCont), what = factor(rep(c("min", 
    "max"), each = nrow(leByCont)), levels = c("min", "max")), continent = factor(rownames(leByCont)))
leByContTall
##    lifeExp what continent
## 1    23.60  min    Africa
## 2    37.58  min  Americas
## 3    28.80  min      Asia
## 4    43.59  min    Europe
## 5    69.12  min   Oceania
## 6    76.44  max    Africa
## 7    80.65  max  Americas
## 8    82.60  max      Asia
## 9    81.76  max    Europe
## 10   81.23  max   Oceania
str(leByContTall)
## 'data.frame':    10 obs. of  3 variables:
##  $ lifeExp  : num  23.6 37.6 28.8 43.6 69.1 ...
##  $ what     : Factor w/ 2 levels "min","max": 1 1 1 1 1 2 2 2 2 2
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5 1 2 3 ..

Now we can make the plot

library(lattice)
stripplot(lifeExp ~ what, leByContTall)
stripplot(lifeExp ~ what, leByContTall, groups = continent, auto.key = TRUE, 
    grid = TRUE, type = "b")

What I did above was cumbersome but I took full control. I have dabbled with the built-in functions stack() and reshape() and found them more trouble than they were worth. If I abandon my primitive approach, it would be in favor of functions in the add-on package reshape, which is hear good things about. This might be another case, like plyr, where you should so as I say, not as I do.

Exporting data

Now let's say we want to export some data or numerical results. The pseudo-inverse of read.table() is write.table().

# need editing writing the min and max life expectancies by continent to
# file
write.table(leByCont, "minMaxLifeExpByContinent.txt")

## all those quotes drive me nuts! use arguments to take more control

## works best for first version of leByCont, where continents are rownames
write.table(leByCont, "minMaxLifeExpByContinent.txt", quote = FALSE, sep = "\t")

## use sparingly: saving an R object
save(leByCont, file = "leByCont.robj")
rm(leByCont)
ls()
leByCont
load("leByCont.robj")
ls()
leByCont

## dput useful for creating small self-contained examples
dput(leByCont, "leByCont_DPUT.R")
rm(leByCont)
leByCont
(leByCont <- dget("leByCont_DPUT.R"))

## sink useful for highly unstructured output
sink("tTestResults.txt")
t.test(lifeExp ~ year, gDat, subset = year %in% c(1952, 2007))
sink()

Question I got: why would one use dput() instead of save(). My answers:

Good reference: Chapter 8 (“Data Aggregation”) of Spector (2008). This whole book is extremely valuable.