Go back to STAT545A home

Homework #3 Data aggregation

Follow the existing homework submission instructions UP TO THE LINK SUBMISSION PART. Submit your links by editing this Google doc -- WARNING: I turned off sharing ~3pm Mon Sept 23 because almost everyone had submitted their links. Contact JB if you need to submit or edit a link.

Tips

Load the Gapminder data and the plyr and xtable packages

## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
library(plyr)
library(xtable)
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
## define a function for converting and printing to HTML table
htmlPrint <- function(x, ...,
                      digits = 0, include.rownames = FALSE) {
  print(xtable(x, digits = digits, ...), type = 'html',
        include.rownames = include.rownames, ...)
  }

I wrote a function htmlPrint() to create an HTML table and print it.

Create your own adventure.

You are welcome to create your own exercises by riffing on what you see below. In fact, that's great -- you won't be copying from your office mate and it emulates real life where you first need to think up interesting questions, then solve them. The easiest sort of variation is to do what I ask below but with a different variable, different summary statistic, different tuning parameter, etc.

Get the maximum and minimum of GDP per capita for all continents in a "wide" format. Easy.

Produce a data.frame with 3 variables: a continent factor, the minimum GDP per capita, the maximum GDP per capita. There will be one row per continent. Sort on either the minimum or the maximum and assess whether the continents rank similarly w/r/t the other extreme. Discuss.

foo <- ddply(gDat, ~ continent, summarize,
             minGdpPercap = min(gdpPercap), maxGdpPercap = max(gdpPercap))
htmlPrint(arrange(foo, minGdpPercap))
continent minGdpPercap maxGdpPercap
Africa 241 21951
Asia 331 113523
Europe 974 49357
Americas 1202 42952
Oceania 10040 34435

Get the maximum and minimum of GDP per capita for all continents in a "tall" format. Medium/hard.

Produce a data.frame with 3 variables: a continent factor, a GDP per capita summary statistic, a factor that describes the statistic (i.e. "min" vs "max"). There will be one row per continent * statistic combination. Don't use reshaping -- get this directly with data aggregation.

foo <- ddply(gDat, ~ continent, function(x) {
  gdpPercap <- range(x$gdpPercap)
  return(data.frame(gdpPercap, stat = c("min", "max")))
  })
htmlPrint(foo)
continent gdpPercap stat
Africa 241 min
Africa 21951 max
Americas 1202 min
Americas 42952 max
Asia 331 min
Asia 113523 max
Europe 974 min
Europe 49357 max
Oceania 10040 min
Oceania 34435 max

Look at the spread of GDP per capita within the continents. Easy.

Measures of spread (we will discuss in class soon, just try them out!):

Produce a data.frame with one row per continent and variables containing the different measures of spread for GDP/capita. Sort on one of those and assess whether the continents rank similarly w/r/t heterogeneity of GDP/capita by the other measures. Discuss.

foo <- ddply(gDat, ~ continent, summarize,
             sdGdpPercap = sd(gdpPercap), madGdpPercap = mad(gdpPercap),
             iqrGdpPercap = IQR(gdpPercap))
htmlPrint(arrange(foo, sdGdpPercap))
continent sdGdpPercap madGdpPercap iqrGdpPercap
Africa 2828 775 1616
Oceania 6359 6459 8072
Americas 6397 3269 4402
Europe 9355 8846 13248
Asia 14045 2821 7492
## spot checks!
IQR(gDat$gdpPercap[gDat$continent == "Europe"])

[1] 13248

mad(gDat$gdpPercap[gDat$continent == "Asia"])

[1] 2821

If you don't get an error but you are still worried you aren't doing the right computation, pick a few cases in the middle at random and check "by hand".

Compute a trimmed mean of life expectancy for different years. Easy/medium.

Compute the average life expectancy by year, as a warm-up exercise. Then consider an alternative: a trimmed mean. There is a trim argument in the built-in function mean(). Pick a modest trim fraction that is greater than 0. Report your sorted results. Make sure your trim level is well-documented.

jTrim <- 0.2
foo <- ddply(gDat, ~ year, summarize, tMean = mean(lifeExp, trim = jTrim))
## Error: object 'jTrim' not found
htmlPrint(arrange(foo, tMean))
## Error: object 'tMean' not found

Hear ye, hear ye: the trim level for the trimmed means above is 0.2. Let it be known that I did not type that number in; rather it was printed from the value assigned in the previous R chunk.

If you haven't used it yet, you should experiment with the backtick method of including inline R code. Look at the R Markdown file to see how I do this. If I change jTrim to a different value in the R code, my prose will automatically update.

How is life expectancy changing over time on different continents? "Tall" format. Easy/medium.

Get a data.frame with 3 variables: continent, year, and mean or median life expectancy. What are the trends over time? Are they the same or different across the continents? Is this easy to assess from this data.frame?

htmlPrint(ddply(gDat, ~ continent + year,
                summarize, medLifeExp = median(lifeExp)))
continent year medLifeExp
Africa 1952 39
Africa 1957 41
Africa 1962 43
Africa 1967 45
Africa 1972 47
Africa 1977 49
Africa 1982 51
Africa 1987 52
Africa 1992 52
Africa 1997 53
Africa 2002 51
Africa 2007 53
Americas 1952 55
Americas 1957 56
Americas 1962 58
Americas 1967 61
Americas 1972 63
Americas 1977 66
Americas 1982 67
Americas 1987 69
Americas 1992 70
Americas 1997 72
Americas 2002 72
Americas 2007 73
Asia 1952 45
Asia 1957 48
Asia 1962 49
Asia 1967 54
Asia 1972 57
Asia 1977 61
Asia 1982 64
Asia 1987 66
Asia 1992 69
Asia 1997 70
Asia 2002 71
Asia 2007 72
Europe 1952 66
Europe 1957 68
Europe 1962 70
Europe 1967 71
Europe 1972 71
Europe 1977 72
Europe 1982 73
Europe 1987 75
Europe 1992 75
Europe 1997 76
Europe 2002 78
Europe 2007 79
Oceania 1952 69
Oceania 1957 70
Oceania 1962 71
Oceania 1967 71
Oceania 1972 72
Oceania 1977 73
Oceania 1982 74
Oceania 1987 75
Oceania 1992 77
Oceania 1997 78
Oceania 2002 80
Oceania 2007 81

How is life expectancy changing over time on different continents? "Wide" format. Hard (or possibly silly).

Get a data.frame with 1 + 5 = 6 variables: year and then one variable per continent, giving mean or median life expectancy. Is there a natural way to do this in plyr without hard-wiring the continent-specific variables and without reshaping an intermediate result? Since I am new-ish to plyr, I don't know yet if this is possible and/or a totally stupid way to go about this. So this one is for people who are happy to play around a bit. Hint: I've made some progress here and advise you look into daply().

foo <- daply(gDat, ~ year + continent,
             summarize, medLifeExp = median(lifeExp))
#str(foo)
foo <- as.data.frame(foo)
#str(foo) ## there's still some goofiness, ie the variables are lists .... hmmmm
htmlPrint(foo, include.rownames = TRUE)
Africa Americas Asia Europe Oceania
1952 39 55 45 66 69
1957 41 56 48 68 70
1962 43 58 49 70 71
1967 45 61 54 71 71
1972 47 63 57 71 72
1977 49 66 61 72 73
1982 51 67 64 73 74
1987 52 69 66 75 75
1992 52 70 69 75 77
1997 53 72 70 76 78
2002 51 72 71 78 80
2007 53 73 72 79 81

Arguably, this table is better transposed. So just switch the order of continent and year in the plyr call.

foo <- daply(gDat, ~ continent + year,
             summarize, medLifeExp = median(lifeExp))
htmlPrint(as.data.frame(foo), include.rownames = TRUE)
1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Africa 39 41 43 45 47 49 51 52 52 53 51 53
Americas 55 56 58 61 63 66 67 69 70 72 72 73
Asia 45 48 49 54 57 61 64 66 69 70 71 72
Europe 66 68 70 71 71 72 73 75 75 76 78 79
Oceania 69 70 71 71 72 73 74 75 77 78 80 81

Count the number of countries with low life expectancy over time by continent. "Tall" format. Medium/hard.

Compute some measure of worldwide life expectancy -- you decide -- a mean or median or some other quantile or perhaps your current age. The determine how many countries on each continent have a life expectancy less than this benchmark, for each year. Produce a data.frame with 3 variables: continent, year, and a country count. What's the trend over time? Is that trend the same for all continents? How much fun is it to assess this from this tall table?

#(bMark <- quantile(gDat$lifeExp, 0.2))
bMark <- 43 # JB wrote this on her 43rd b'day!
htmlPrint(ddply(gDat, ~ continent + year,
                function(x) c(lowLifeExp = sum(x$lifeExp <= bMark))))
continent year lowLifeExp
Africa 1952 43
Africa 1957 34
Africa 1962 28
Africa 1967 20
Africa 1972 13
Africa 1977 10
Africa 1982 7
Africa 1987 4
Africa 1992 5
Africa 1997 6
Africa 2002 4
Africa 2007 6
Americas 1952 5
Americas 1957 2
Americas 1962 0
Americas 1967 0
Americas 1972 0
Americas 1977 0
Americas 1982 0
Americas 1987 0
Americas 1992 0
Americas 1997 0
Americas 2002 0
Americas 2007 0
Asia 1952 12
Asia 1957 11
Asia 1962 5
Asia 1967 3
Asia 1972 3
Asia 1977 2
Asia 1982 1
Asia 1987 1
Asia 1992 1
Asia 1997 1
Asia 2002 1
Asia 2007 0
Europe 1952 0
Europe 1957 0
Europe 1962 0
Europe 1967 0
Europe 1972 0
Europe 1977 0
Europe 1982 0
Europe 1987 0
Europe 1992 0
Europe 1997 0
Europe 2002 0
Europe 2007 0
Oceania 1952 0
Oceania 1957 0
Oceania 1962 0
Oceania 1967 0
Oceania 1972 0
Oceania 1977 0
Oceania 1982 0
Oceania 1987 0
Oceania 1992 0
Oceania 1997 0
Oceania 2002 0
Oceania 2007 0

Compute the proportion of countries with low life expectancy over time by continent. Medium extension of another task.

The continents have very different numbers of countries. Maybe the proportion of "low life expectancy" countries is more relevant than the raw count. Or maybe the final table should report both? Work on that.

## note: bMark set in previous chunk
htmlPrint(ddply(gDat, ~ continent + year, function(x) {
  jCount = sum(x$lifeExp <= bMark)
  c(count = jCount, prop = jCount / nrow(x))
}), digits = c(0, 0, 0, 0, 2))
continent year count prop
Africa 1952 43 0.83
Africa 1957 34 0.65
Africa 1962 28 0.54
Africa 1967 20 0.38
Africa 1972 13 0.25
Africa 1977 10 0.19
Africa 1982 7 0.13
Africa 1987 4 0.08
Africa 1992 5 0.10
Africa 1997 6 0.12
Africa 2002 4 0.08
Africa 2007 6 0.12
Americas 1952 5 0.20
Americas 1957 2 0.08
Americas 1962 0 0.00
Americas 1967 0 0.00
Americas 1972 0 0.00
Americas 1977 0 0.00
Americas 1982 0 0.00
Americas 1987 0 0.00
Americas 1992 0 0.00
Americas 1997 0 0.00
Americas 2002 0 0.00
Americas 2007 0 0.00
Asia 1952 12 0.36
Asia 1957 11 0.33
Asia 1962 5 0.15
Asia 1967 3 0.09
Asia 1972 3 0.09
Asia 1977 2 0.06
Asia 1982 1 0.03
Asia 1987 1 0.03
Asia 1992 1 0.03
Asia 1997 1 0.03
Asia 2002 1 0.03
Asia 2007 0 0.00
Europe 1952 0 0.00
Europe 1957 0 0.00
Europe 1962 0 0.00
Europe 1967 0 0.00
Europe 1972 0 0.00
Europe 1977 0 0.00
Europe 1982 0 0.00
Europe 1987 0 0.00
Europe 1992 0 0.00
Europe 1997 0 0.00
Europe 2002 0 0.00
Europe 2007 0 0.00
Oceania 1952 0 0.00
Oceania 1957 0 0.00
Oceania 1962 0 0.00
Oceania 1967 0 0.00
Oceania 1972 0 0.00
Oceania 1977 0 0.00
Oceania 1982 0 0.00
Oceania 1987 0 0.00
Oceania 1992 0 0.00
Oceania 1997 0 0.00
Oceania 2002 0 0.00
Oceania 2007 0 0.00

Report the absolute and/or relative abundance of countries with low life expectancy over time by continent. "Wide" format. Extension of another task of unknown difficulty / value.

Create a table with one row per year and one column -- at least conceptually -- per continent. In that "column" report the absolute and/or relative abundance of "low life expectancy" countries. Is there a natural way to do this in plyr without hard-wiring the continent-specific variables and without reshaping an intermediate result? Since I am new-ish to plyr, I don't know yet if this is possible and/or a totally stupid way to go about this. So this one is for people who are happy to play around a bit. Hint: I've made some progress here and advise you look into daply(). I ended up skipping straight to character output versus an object appropriate for further computation or graphing. Sometimes it's better to approach these ill-defined tasks from a different angle.

## note: bMark set in a previous chunk

## my first success ... but was still a 3-dim array :(
## that would be useful for plotting, at least after some reshaping, but is not
## so easy on the eyes
# daply(gDat, ~ year + continent, function(x) {
#   jCount <- sum(x$lifeExp < bMark)
#   c(count = jCount, prop = jCount / nrow(x))
# })

foo <- daply(gDat, ~ year + continent, function(x) {
  jCount <- sum(x$lifeExp <= bMark)
  jTotal <- nrow(x)
  jProp <- jCount / jTotal
  return(sprintf("%1.2f (%d/%d)", jProp, jCount, jTotal))
  })
htmlPrint(foo, include.rownames = TRUE)
Africa Americas Asia Europe Oceania
1952 0.83 (43/52) 0.20 (5/25) 0.36 (12/33) 0.00 (0/30) 0.00 (0/2)
1957 0.65 (34/52) 0.08 (2/25) 0.33 (11/33) 0.00 (0/30) 0.00 (0/2)
1962 0.54 (28/52) 0.00 (0/25) 0.15 (5/33) 0.00 (0/30) 0.00 (0/2)
1967 0.38 (20/52) 0.00 (0/25) 0.09 (3/33) 0.00 (0/30) 0.00 (0/2)
1972 0.25 (13/52) 0.00 (0/25) 0.09 (3/33) 0.00 (0/30) 0.00 (0/2)
1977 0.19 (10/52) 0.00 (0/25) 0.06 (2/33) 0.00 (0/30) 0.00 (0/2)
1982 0.13 (7/52) 0.00 (0/25) 0.03 (1/33) 0.00 (0/30) 0.00 (0/2)
1987 0.08 (4/52) 0.00 (0/25) 0.03 (1/33) 0.00 (0/30) 0.00 (0/2)
1992 0.10 (5/52) 0.00 (0/25) 0.03 (1/33) 0.00 (0/30) 0.00 (0/2)
1997 0.12 (6/52) 0.00 (0/25) 0.03 (1/33) 0.00 (0/30) 0.00 (0/2)
2002 0.08 (4/52) 0.00 (0/25) 0.03 (1/33) 0.00 (0/30) 0.00 (0/2)
2007 0.12 (6/52) 0.00 (0/25) 0.00 (0/33) 0.00 (0/30) 0.00 (0/2)

Find countries with interesting stories. Open-ended and, therefore, hard. Realistic!

It was easy (see above) to get the minimum life expectancy seen on each continent by year. Now try to report the year, the continent, the minimum (or max or whatever) life expectancy AND the country it pertains to. You could do something similar with GDP per capita or population (though result likely to be boring with population). Hint: read up on which.min() and which.max().

## this works but produces a frustrating long/tall result; not good for printing
# ddply(gDat, ~ year + continent, function(x) {
#   theMin <- which.min(x$lifeExp)
#   x[theMin, c("country", "year", "continent", "lifeExp")]
#   })

foo <- daply(gDat, ~ year + continent, function(x) {
  theMin <- which.min(x$lifeExp)
  return(sprintf("%1.0f %s", x$lifeExp[theMin], x$country[theMin]))
})
htmlPrint(foo, include.rownames = TRUE)
Africa Americas Asia Europe Oceania
1952 30 Gambia 38 Haiti 29 Afghanistan 44 Turkey 69 Australia
1957 32 Sierra Leone 41 Haiti 30 Afghanistan 48 Turkey 70 New Zealand
1962 33 Sierra Leone 43 Bolivia 32 Afghanistan 52 Turkey 71 Australia
1967 34 Sierra Leone 45 Bolivia 34 Afghanistan 54 Turkey 71 Australia
1972 35 Sierra Leone 47 Bolivia 36 Afghanistan 57 Turkey 72 New Zealand
1977 37 Sierra Leone 50 Haiti 31 Cambodia 60 Turkey 72 New Zealand
1982 38 Sierra Leone 51 Haiti 40 Afghanistan 61 Turkey 74 New Zealand
1987 40 Angola 54 Haiti 41 Afghanistan 63 Turkey 74 New Zealand
1992 24 Rwanda 55 Haiti 42 Afghanistan 66 Turkey 76 New Zealand
1997 36 Rwanda 57 Haiti 42 Afghanistan 69 Turkey 78 New Zealand
2002 39 Zambia 58 Haiti 42 Afghanistan 71 Turkey 79 New Zealand
2007 40 Swaziland 61 Haiti 44 Afghanistan 72 Turkey 80 New Zealand

Consider the linear regression we fit in tutorial of life expectancy vs. time. Find the min (and/or max) of the slope (and/or the intercept) within each continent. Report these interesting countries and some info about them in a data.frame. This might work for other variables too.

yearMin <- min(gDat$year)
jFun <- function(x) {
  estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
  names(estCoefs) <- c("intercept", "slope")
  return(estCoefs)
  }
jCoefs <- ddply(gDat, ~ country + continent, jFun)
## giving out the Most Improved Award!
## going after maximum slope within continent
foo <- ddply(jCoefs, ~ continent, function(x) {
  theMax <- which.max(x$slope)
  x[theMax, c("continent", "country", "intercept", "slope")]
  })
htmlPrint(foo, digits = c(0, 0, 0, 0, 2))
continent country intercept slope
Africa Libya 42 0.63
Americas Nicaragua 43 0.56
Asia Oman 37 0.77
Europe Turkey 46 0.50
Oceania Australia 68 0.23

Sudden, substantial departures from the temporal trend is interesting. This goes for life expectancy, GDP per capita, or population. How might one operationalize this notion of "interesting"?

Make a data.frame of interesting countries, reporting their continent affiliation, and summary statistics on life expectancy (or GDP/capita or whatever you were looking at). Won't it be interesting to start to graphically explore them ... next week.

yearMin <- min(gDat$year)
jFun <- function(x) {
  jFit <- lm(lifeExp ~ I(year - yearMin), x)
  jCoef <- coef(jFit)
  names(jCoef) <- NULL
  return(c(intercept = jCoef[1],
           slope = jCoef[2],
           maxResid = max(abs(resid(jFit)))/summary(jFit)$sigma))
  }
maxResids <- ddply(gDat, ~ country + continent, jFun)
foo <- ddply(maxResids, ~ continent, function(x) {
  theMax <- which.max(x$maxResid)
  x[theMax, ]
  })
htmlPrint(foo, digits = c(0, 0, 0, 2, 2, 2))
country continent intercept slope maxResid
Rwanda Africa 42.74 -0.05 2.64
Ecuador Americas 49.07 0.50 2.25
Cambodia Asia 37.02 0.40 2.79
Finland Europe 66.45 0.24 2.72
Australia Oceania 68.40 0.23 1.65
## I can't resist, I must make a plot
xyplot(lifeExp ~ year | country, gDat,
       subset = country %in% foo$country, type = c("p", "r"))

Discuss: Is Ecuador really interesting? Do you think it's the most interesting country in the Americas? I doubt it. What could we do better? Making plots is ABSOLUTELY ESSENTIAL to sanity checking your claims and beliefs based on numerical computation.

Student work

Q and A

Student: consider the exercises where we return info on the country that exhibits, e.g., the minimum life expectancy for a certain continent and/or year. What if there are ties and that minimum is achieved by more than 1 country?

I consider this a special case or extension of something we did above: identifying "low life expectancy" countries. So I will mix that code with our min/max finding code. BTW I cannot find an instance where any countries tie for lowest or longest life expectancy, but my two stage solution reflects my attempt to look into that. I created a copy of gDat and altered two life expectancies to create a tie! No, people are not usually living past 110 in Norway and Sweden!

## I couldn't find any examples of two or more countries sharing the min or max
## life expectancy in a given year; so I tweaked the data!

testDat <- gDat
toReplace <- with(testDat, country %in% c("Norway", "Sweden") & year == 2007)
testDat[toReplace, "lifeExp"] <- 110

## for each life expectancy, compute its rank within all life expectancies
## recorded that year
ggDat <- ddply(testDat, ~ year, transform,
               leRank = rank(lifeExp, ties.method = "average"))
## look at the new leRank variable, if you wish
#table(ggDat$leRank)

## pull out the countries with extreme ranks ... with min and max being the most
## extreme of the extremes!
foo <- ddply(ggDat, ~ year, function(x) {
  minRank <- min(x$leRank) # you could change this to find the lowest, e.g. 2
  maxRank <- max(x$leRank) # and the highest 2
  whoMin <- which(x$leRank == minRank) # I allow this to have length > 1
  whoMax <- which(x$leRank == maxRank)
  data.frame(year = x$year[whoMin[1]],
             minLifeExp = x$lifeExp[whoMin[1]],
             minCountries = paste(x$country[whoMin], collapse = " "),
             minRank = minRank,
             maxLifeExp = x$lifeExp[whoMax[1]],
             maxCountries = paste(x$country[whoMax], collapse = " "),
             maxRank =  maxRank)
  })
htmlPrint(foo, digits = c(0, 0, 0, 0, 1, 0, 0, 1))
year minLifeExp minCountries minRank maxLifeExp maxCountries maxRank
1952 29 Afghanistan 1.0 73 Norway 142.0
1957 30 Afghanistan 1.0 73 Iceland 142.0
1962 32 Afghanistan 1.0 74 Iceland 142.0
1967 34 Afghanistan 1.0 74 Sweden 142.0
1972 35 Sierra Leone 1.0 75 Sweden 142.0
1977 31 Cambodia 1.0 76 Iceland 142.0
1982 38 Sierra Leone 1.0 77 Japan 142.0
1987 40 Angola 1.0 79 Japan 142.0
1992 24 Rwanda 1.0 79 Japan 142.0
1997 36 Rwanda 1.0 81 Japan 142.0
2002 39 Zambia 1.0 82 Japan 142.0
2007 40 Swaziland 1.0 110 Norway Sweden 141.5

Remember, the 2007 data for Norway and Sweden is fake; useful for testing this code.