Data aggregation

Optional getting started advice

Ignore if you don't need this bit of support.

This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the Gapminder project. Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an RStudio project. To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the getwd() command or by glancing at the top of RStudio's Console pane.

Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodicially copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "data aggregation".

Load the Gapminder data

Assuming the data can be found in the current working directory, this works:

gDat <- read.delim("gapminderDataFiveYear.txt")

Plan B (I use here, because of where the source of this tutorial lives):

## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)

Basic sanity check that the import has gone well:

str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Data aggregation

If you feel the urge to store a little snippet of a data.frame:

(snippet <- subset(gDat, country == "Canada"))
##     country year      pop continent lifeExp gdpPercap
## 241  Canada 1952 14785584  Americas   68.75     11367
## 242  Canada 1957 17010154  Americas   69.96     12490
## 243  Canada 1962 18985849  Americas   71.30     13462
## 244  Canada 1967 20819767  Americas   72.13     16077
## 245  Canada 1972 22284500  Americas   72.88     18971
## 246  Canada 1977 23796400  Americas   74.21     22091
## 247  Canada 1982 25201900  Americas   75.76     22899
## 248  Canada 1987 26549700  Americas   76.86     26627
## 249  Canada 1992 28523502  Americas   77.95     26343
## 250  Canada 1997 30305843  Americas   78.61     28955
## 251  Canada 2002 31902268  Americas   79.77     33329
## 252  Canada 2007 33390141  Americas   80.65     36319

Stop and ask yourself ...

Do I want to create sub-data.frames for each level of some factor (or unique combination of several factors) ... in order to compute or graph something?

If NO, then maybe you really do need to store a copy of a subset of the data.frame. But seriously consider whether you can achieve your goals by simply using the subset = argument -- or perhaps with() coupled with subset() -- to enact a computation on a specific set of rows. If this still does not suit your needs, then maybe you really should use subset() as shown above and carry on.

If YES, use data aggregation techniques or conditioning in lattice or facetting in ggplot2 plots -- don’t subset the data.frame. Or, to be totally clear, only subset the data.frame as a temporary measure as you develop your elegant code for computing on or visualizing these sub-data.frames.

Data aggregation landscape

There are two main options for data aggregation:

I used the built-in functions for many years but am transitioning to plyr. I recommend simply starting with plyr if you are new to R. You can see older material about data aggregation with built-in functions here. In this tutorial we will only use plyr.

You'll notice I did not even mention another option that may occur to some: hand-coding for loops, perhaps, even (shudder) nested for loops! Don't do it. By the end of this tutorial you'll see things that are much faster and more fun. Yes, of course, tedious loops are required for data aggregation but when you can, let other developers write them for you, in super-efficient low level code. This is more about saving programmer time than compute time, BTW.

Install and load plyr

If you have not already done so, you'll need to install plyr. Here's one way to do that:

install.packages("plyr", dependencies = TRUE)

You will also need to load the package before you can use the functions in an R session:

library(plyr)

You can make that apply to all your R sessions via your .Rprofile. Note this is a controversial practice, because it means your code will not necessarily run "as is" on someone else's computer. For that reason, I auto-load packages very sparingly. plyr is not (yet) on my list.

plyr Big Ideas

The plyr functions will not make much sense viewed individually, e.g. simply reading the help for ddply() is not the fast track to competence. There is a very important over-arching logic for the package and it is well worth reading the article The split-apply-combine strategy for data analysis, Hadley Wickham, Journal of Statistical Software, vol. 40, no. 1, pp. 1–29, 2011. Though it is no substitute for reading the above, here is the most critical information:

Today we will emphasize ddply() which accepts a data.frame, splits it into pieces based on one or more factors, computes on the pieces, then returns the results as a data.frame. For the record, the built-in functions most relevant to ddply() are tapply() and friends.

ddply()

Let's say we want to get the maximum life expectancy for each continent.

(maxLeByCont <- ddply(gDat, ~ continent, summarize, maxLifeExp = max(lifeExp)))
##   continent maxLifeExp
## 1    Africa      76.44
## 2  Americas      80.65
## 3      Asia      82.60
## 4    Europe      81.76
## 5   Oceania      81.23

Let's study the return value.

str(maxLeByCont)
## 'data.frame':    5 obs. of  2 variables:
##  $ continent : Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
##  $ maxLifeExp: num  76.4 80.7 82.6 81.8 81.2
levels(maxLeByCont$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

So we got a data.frame back, with one observation per continent, and two variables: the maximum life expectancies and the continent, as a factor, with the same levels in the same order, as for the input data.frame gDat. If you have sweated to do such things with built-in functions, this minor miracle might make you cry tears of joy (or anguish over all the hours you have wasted.)

summarize() or its synonym summarise() is a function provided by plyr that creates a new data.frame from an old one. It is related to the built-in function transform() that transforms variables in a data.frame or adds new ones. Feel free to play with it a bit in some top-level commands; you will use it alot inside plyr calls.

The two variables in maxLeByCont come from two sources. The continent factor is provided by ddply() and represents the labelling of the life expectancies with their associated continent. This is the book-keeping associated with dividing the input into little bits, computing on them, and gluing the results together again in an orderly, labelled fashion. We can take more credit for the other variable maxLifeExp, which has a name we chose ("maxLifeExp") and arises from applying a function we specified (max()) to a variable of our choice (lifeExp).

You try: compute the minimum GDP per capita by continent. Here's what I get:

##   continent minGdpPercap
## 1    Africa        241.2
## 2  Americas       1201.6
## 3      Asia        331.0
## 4    Europe        973.5
## 5   Oceania      10039.6

You might have chosen a different name for the minimum GDP/capita's, but your numerical results should match.

The function you want to apply to the continent-specific data.frames can be built-in, like max() above, or a custom function you've written. This custom function can be written in advance or specified 'on the fly'. Here's how I would count the number of countries in this dataset for each continent.

ddply(gDat, ~continent, summarize, nUniqCountries = length(unique(country)))
##   continent nUniqCountries
## 1    Africa             52
## 2  Americas             25
## 3      Asia             33
## 4    Europe             30
## 5   Oceania              2

Here is another way to do the same thing that doesn't use summarize() at all:

ddply(gDat, ~ continent,
      function(x) return(c(nUniqCountries = length(unique(x$country)))))
##   continent nUniqCountries
## 1    Africa             52
## 2  Americas             25
## 3      Asia             33
## 4    Europe             30
## 5   Oceania              2

In pseudo pseudo-code, here is what's happening in both of the above commands:

returnValue <- an empty receptacle with one "slot" per country
for each possible country i {
    x  <- subset(gDat, subset = country == i)
    returnValue[i] <- length(unique(x$country))
    name or label for returnValue[i] is set to country i
}
ddply packages returnValue and associate names/labels as a nice data.frame

You don't have to compute just one thing for each sub-data.frame, nor are you limited to computing on just one variable. Check it out.

ddply(gDat, ~ continent, summarize,
      minLifeExp = min(lifeExp), maxLifeExp = max(lifeExp),
      medGdpPercap = median(gdpPercap))
##   continent minLifeExp maxLifeExp medGdpPercap
## 1    Africa      23.60      76.44         1192
## 2  Americas      37.58      80.65         5466
## 3      Asia      28.80      82.60         2647
## 4    Europe      43.59      81.76        12082
## 5   Oceania      69.12      81.23        17983

Putting it all together: using ddply() and polishing the results

Now I want to do something more complicated. I want to fit a linear regression for each country, modelling life expectancy as a function of the year and then retain the estimated intercepts and slopes. I will walk before I run. Therefore, I will create a tiny sub-data.frame to prototype this, before I fold it into a ddply() call. If you're a newbie, watch how complicated tasks are slowly constructed.

jCountry <- "France"  # pick, but do not hard wire, an example
(jDat <- subset(gDat, country == jCountry))  # temporary measure!
##     country year      pop continent lifeExp gdpPercap
## 529  France 1952 42459667    Europe   67.41      7030
## 530  France 1957 44310863    Europe   68.93      8663
## 531  France 1962 47124000    Europe   70.51     10560
## 532  France 1967 49569000    Europe   71.55     13000
## 533  France 1972 51732000    Europe   72.38     16107
## 534  France 1977 53165019    Europe   73.83     18293
## 535  France 1982 54433565    Europe   74.89     20294
## 536  France 1987 55630100    Europe   76.34     22066
## 537  France 1992 57374179    Europe   77.46     24704
## 538  France 1997 58623428    Europe   78.64     25890
## 539  France 2002 59925035    Europe   79.59     28926
## 540  France 2007 61083916    Europe   80.66     30470
xyplot(lifeExp ~ year, jDat, type = c("p", "r"))  # always plot the data
jFit <- lm(lifeExp ~ year, jDat)
summary(jFit)
## 
## Call:
## lm(formula = lifeExp ~ year, data = jDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3801 -0.1389  0.0124  0.1429  0.3349 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.98e+02   7.29e+00   -54.6  1.0e-13 ***
## year         2.38e-01   3.68e-03    64.8  1.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.22 on 10 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.997 
## F-statistic: 4.2e+03 on 1 and 10 DF,  p-value: 1.86e-14

Wow, check out that crazy intercept! Apparently the life expectancy in France around year 0 A.D. was minus 400 years! This a great opportunity for some sanity checking of a model fit and thinking about how to reparametrize the model to make the parameters have natural interpretation. I think it makes more sense for the intercept to correspond to life expectancy in 1952, the earliest date in our dataset. Let's try that again.

(yearMin <- min(gDat$year))
## [1] 1952
jFit <- lm(lifeExp ~ I(year - yearMin), jDat)
summary(jFit)
## 
## Call:
## lm(formula = lifeExp ~ I(year - yearMin), data = jDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3801 -0.1389  0.0124  0.1429  0.3349 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       67.79013    0.11949   567.3  < 2e-16 ***
## I(year - yearMin)  0.23850    0.00368    64.8  1.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.22 on 10 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.997 
## F-statistic: 4.2e+03 on 1 and 10 DF,  p-value: 1.86e-14

An intercept around 68 years makes much more common sense and is also supported by our plot. What is this jFit object and how can I get stuff out of it?

class(jFit)
## [1] "lm"
mode(jFit)
## [1] "list"

It turns out jFit is of class "lm" and its mode is list. So that means I could use indexing to isolate specific components. But what's in there?

## str(jFit) # too ugly to print here but you should look
names(jFit)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
jFit$coefficients
##       (Intercept) I(year - yearMin) 
##           67.7901            0.2385

Using str() and names() reveals a great deal about this "lm" object and reading the help file for lm() would explain a great deal more. In the See Also section we learn there's a generic function coef() which looks promising.

coef(jFit)
##       (Intercept) I(year - yearMin) 
##           67.7901            0.2385

As a rule, I use extractor functions like this when they are available.

We have achieved our goal for this specific country -- we've gotten its intercept and slope. Now we need to package that as a function (we will talk about functions properly later, but this should be fairly self-explanatory).

jFun <- function(x) coef(lm(lifeExp ~ I(year - yearMin), x))
jFun(jDat)  # trying out our new function ... yes still get same numbers
##       (Intercept) I(year - yearMin) 
##           67.7901            0.2385

I hate the names of these return values. Good names pay off downstream, so I will enhance my function.

jFun <- function(x) {
    estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
    names(estCoefs) <- c("intercept", "slope")
    return(estCoefs)
}
jFun(jDat)  # trying out our improved function ... yes still get same numbers
## intercept     slope 
##   67.7901    0.2385

It's always a good idea to try out a function on a few small examples.

jFun(subset(gDat, country == "Canada"))
## intercept     slope 
##   68.8838    0.2189
jFun(subset(gDat, country == "Uruguay"))
## intercept     slope 
##   65.7416    0.1833
jFun(subset(gDat, country == "India"))
## intercept     slope 
##   39.2698    0.5053

It seems like we are ready to scale up by placing this function inside a ddply() call.

jCoefs <- ddply(gDat, ~country, jFun)
str(jCoefs)
## 'data.frame':    142 obs. of  3 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10..
##  $ intercept: num  29.9 59.2 43.4 32.1 62.7 ...
##  $ slope    : num  0.275 0.335 0.569 0.209 0.232 ...
tail(jCoefs)
##                country intercept    slope
## 137          Venezuela     57.51  0.32972
## 138            Vietnam     39.01  0.67162
## 139 West Bank and Gaza     43.80  0.60110
## 140        Yemen, Rep.     30.13  0.60546
## 141             Zambia     47.66 -0.06043
## 142           Zimbabwe     55.22 -0.09302

We did it! By the time we've packaged the computation in a function, the call itself is deceptively simple. To review, here's the script I would save from our work in this section:

## realistically, you would read the data from a local file
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
## str(gDat) here when working interactively
yearMin <- min(gDat$year)
jFun <- function(x) {
    estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
    names(estCoefs) <- c("intercept", "slope")
    return(estCoefs)
}
## jFun(subset(gDat, country == 'India')) to see what it does
jCoefs <- ddply(gDat, ~country, jFun)

Over the course of its development, the number of lines in the script would start small, get bigger as I fiddle around, make mistakes, write and use lots of sanity checking code and then .... contract down to the above, as I strip it down to bare necessities. That is how the pros actually work. They don't write beautiful elegant scripts the first time and they aren't satisfied with hideous hack-y "transcripts" of the very first attempt that (sort of) worked.

Finally, let's present this information attractively in a table. I consider my advice about how to do this less definitive than the above. I'll be happy to see people explore other table-making tools. Fixed width plain text printing of data.frames is OK for internal use and during development. But at some point you will want a nicer looking table. Markdown doesn't have a proper table syntax, because it is a ruthlessly simple language. Some Markdown dialects / processing engines support table extensions, BTW. But here I will teach you how to make an HTML table. This HTML code will survive unharmed as your R Markdown document is converted from R Markdown to Markdown and finally to HTML.

I've experimented with the xtable package, which you will need to install

install.packages("xtable", dependencies = TRUE)

and load.

library(xtable)

Let's pick some countries at random and display their estimated coefficients. FYI: the following R chunk has an option results='asis' and that is important to the correct display of the table.

set.seed(916)
foo <- jCoefs[sample(nrow(jCoefs), size = 15), ]
foo <- xtable(foo)
print(foo, type = "html", include.rownames = FALSE)
country intercept slope
Lebanon 58.69 0.26
Senegal 36.75 0.50
Dominican Republic 48.60 0.47
Oman 37.21 0.77
Germany 67.57 0.21
Korea, Dem. Rep. 54.91 0.32
Mauritius 55.37 0.35
Slovak Republic 67.01 0.13
Comoros 40.00 0.45
Argentina 62.69 0.23
Central African Republic 38.81 0.18
Ecuador 49.07 0.50
West Bank and Gaza 43.80 0.60
Egypt 40.97 0.56
Myanmar 41.41 0.43

Two easy improvments to make this table more useful are

The easiest way to get the continent information is to enhance our ddply() call. Here is what we used:

jCoefs <- ddply(gDat, ~country, jFun)

This divides gDat into country-specific pieces. But we can supply two factors in the second argument: country and continent. In theory, sub-data.frames will be made for all possible combinations of the levels of country and continent. Many of those will have zero rows because there is, for example, no Belgium in Asia. By default, plyr functions drop these empty combinations. But the labelling work done by ddply() will still help us, as we will get both country and continent as factors in our result. This is easier to see than explain!

jCoefs <- ddply(gDat, ~country + continent, jFun)
str(jCoefs)
## 'data.frame':    142 obs. of  4 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10..
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 ..
##  $ intercept: num  29.9 59.2 43.4 32.1 62.7 ...
##  $ slope    : num  0.275 0.335 0.569 0.209 0.232 ...
tail(jCoefs)
##                country continent intercept    slope
## 137          Venezuela  Americas     57.51  0.32972
## 138            Vietnam      Asia     39.01  0.67162
## 139 West Bank and Gaza      Asia     43.80  0.60110
## 140        Yemen, Rep.      Asia     30.13  0.60546
## 141             Zambia    Africa     47.66 -0.06043
## 142           Zimbabwe    Africa     55.22 -0.09302

Now, prior to making the HTML table, we will sort the data.frame, so it starts with the country with the shortest life expectancy in 1952, and goes to the largest.

set.seed(916)
foo <- jCoefs[sample(nrow(jCoefs), size = 15), ]
foo <- arrange(foo, intercept)
## foo <- foo[order(foo$intercept), ] # an uglier non-plyr way
foo <- xtable(foo)
print(foo, type = "html", include.rownames = FALSE)
country continent intercept slope
Senegal Africa 36.75 0.50
Oman Asia 37.21 0.77
Central African Republic Africa 38.81 0.18
Comoros Africa 40.00 0.45
Egypt Africa 40.97 0.56
Myanmar Asia 41.41 0.43
West Bank and Gaza Asia 43.80 0.60
Dominican Republic Americas 48.60 0.47
Ecuador Americas 49.07 0.50
Korea, Dem. Rep. Asia 54.91 0.32
Mauritius Africa 55.37 0.35
Lebanon Asia 58.69 0.26
Argentina Americas 62.69 0.23
Slovak Republic Europe 67.01 0.13
Germany Europe 67.57 0.21

Soon we will start making the companion plots ... but for now our work is done.

Lessons

plyr is a powerful package for data aggregation.

ddply() is the most important function: data.frames are great, therefore ddply() is great because it takes data.frame as input and returns data.frame as output.

Simple functions, built-in or user-defined, can be provided directly in a call to ddply().

It's better to handle more complicated data aggregation differently. Build it up slowly and revisit earlier steps for improvement. Here is a gentle workflow:

Results you present to the world generally need polishing to make the best impression and have the best impact. This too will be an iterative process. Here are some good things to think about:

Exporting data

What it you wanted to write the table of slopes and intercepts to file? Go to the tutorial on getting data out of R, but if this is an emergency, the main functions to consider are write.table() and saveRDS().

References

plyr paper: The split-apply-combine strategy for data analysis, Hadley Wickham, Journal of Statistical Software, vol. 40, no. 1, pp. 1–29, 2011. Go here for supplements, such as example code from the paper.

Data Manipulation with R by Phil Spector, Springer (2008) | author webpage | GoogleBooks search

Q & A

Student: How do you pass more than one argument for a function into ddply(). The main example that we used in class was this:

(yearMin <- min(gDat$year))
## [1] 1952
jFun <- function(x) {
    estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
    names(estCoefs) <- c("intercept", "slope")
    return(estCoefs)
}
jCoefs <- ddply(gDat, ~country, jFun)
head(jCoefs)
##       country intercept  slope
## 1 Afghanistan     29.91 0.2753
## 2     Albania     59.23 0.3347
## 3     Algeria     43.37 0.5693
## 4      Angola     32.13 0.2093
## 5   Argentina     62.69 0.2317
## 6   Australia     68.40 0.2277

and jFun only requires one argument, x. What if it had more than one argument?

Answer: Let's imagine that the shift for the year covariate is an argument instead of a previously-assigned variable yearMin. Here's how it would work.

jFunTwoArgs <- function(x, cvShift = 0) {
    estCoefs <- coef(lm(lifeExp ~ I(year - cvShift), x))
    names(estCoefs) <- c("intercept", "slope")
    return(estCoefs)
}

Since I've assigned cvShift = a default value of zero, we can get coefficients where the intercept corresponds to the year A.D. 0 with this simple call:

jCoefsSilly <- ddply(gDat, ~country, jFunTwoArgs)
head(jCoefsSilly)
##       country intercept  slope
## 1 Afghanistan    -507.5 0.2753
## 2     Albania    -594.1 0.3347
## 3     Algeria   -1067.9 0.5693
## 4      Angola    -376.5 0.2093
## 5   Argentina    -389.6 0.2317
## 6   Australia    -376.1 0.2277

We are getting the same estimated slopes but the silly year 0 intercepts we've seen before. Let's use the cvShift = argument to resolve this.

jCoefsSane <- ddply(gDat, ~country, jFunTwoArgs, cvShift = 1952)
head(jCoefsSane)
##       country intercept  slope
## 1 Afghanistan     29.91 0.2753
## 2     Albania     59.23 0.3347
## 3     Algeria     43.37 0.5693
## 4      Angola     32.13 0.2093
## 5   Argentina     62.69 0.2317
## 6   Australia     68.40 0.2277

We're back to our usual estimated intercepts, which reflect life expectancy in 1952. Of course hard-wiring 1952 is not a great idea, so here's probably our best code yet:

jCoefsBest <- ddply(gDat, ~country, jFunTwoArgs, cvShift = min(gDat$year))
head(jCoefsBest)
##       country intercept  slope
## 1 Afghanistan     29.91 0.2753
## 2     Albania     59.23 0.3347
## 3     Algeria     43.37 0.5693
## 4      Angola     32.13 0.2093
## 5   Argentina     62.69 0.2317
## 6   Australia     68.40 0.2277