Be the boss of your factors

JB: Still under development. Cannot decide if this is a tutorial, with a story, and fitting into a larger sequence of tutorials or is a topic/reference sort of thing.

Optional getting started advice

Ignore if you don't need this bit of support.

This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the Gapminder project. Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an RStudio project. To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the getwd() command or by glancing at the top of RStudio's Console pane.

Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodicially copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "factor hygiene".

Load the Gapminder data and lattice and plyr

Assuming the data can be found in the current working directory, this works:

gDat <- read.delim("gapminderDataFiveYear.txt")

Plan B (I use here, because of where the source of this tutorial lives):

## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)

Basic sanity check that the import has gone well:

str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Load the lattice and plyr packages.

library(lattice)
library(plyr)

Factors are high maintenance variables

Factors are used to hold categorical variables in R. In Gapminder, the examples are continent and country, with year being a numeric variable that can be gracefully converted to a factor depending on context. Under the hood, factors are stored as integer codes, e.g., 1, 2, ... These integer codes are associated with a vector of character strings, the levels, which is what you will see much more often in the Console and in figures. If you read the documentation on factor() you'll see confusing crazy talk about another thing, labels =, which I've never understood and completely ignore.

##     levels
## 1   Africa
## 2 Americas
## 3     Asia
## 4   Europe
## 5  Oceania

But don't ever, ever forget that factors are fundamentally numeric. Apply the functions class(), mode(), and typeof() to a factor if you need convincing.

##         factor
## class   factor
## mode   numeric
## typeof integer

Jenny's Law says that some great huge factor misunderstanding will eat up hours of valuable time in any given analysis. When you're beating your head against the wall, look very closely at your factors. Are you using them in a character context? Do you have factors you did not know about, e.g., variables you perceived as character but that are actually factors? str() is a mighty weapon.

Factor avoidance: is this strategy for you?

Many people have gotten so frustrated with factors that they refuse to use them. While I'm sympathetic, it's a counterproductive overreaction. Factors are sufficiently powerful in data aggregation, modelling, and visualization to merit their use. Furthermore, factor-abstainers are a fringe minority in the R world which makes it harder to share and remix code from others.

http://stackoverflow.com/questions/3445316/factors-in-r-more-than-an-annoyance

Factor avoidance: how to achieve, selectively

Most unwanted factors are created upon import with read.table(), which converts character variables to factors by default. The same default behavior also happens when you construct a data.frame explicitly with data.frame().

Here are options to take control in a more refined fashion:

Convert an existing factor to a character variable with as.character().

How to make, study, and unmake a factor

If you want to create a factor explicitly, use factor(). If you already have a reason and the knowledge to override the default alphabetical ordering of the factor levels, seize this opportunity to set them via the levels = argument.

To check the levels or count them, use levels() and nlevels().

To tabulate a factor, use table(). Sometimes it's nice to post-process this table with as.data.frame, prop.table(), or addmargins().

To get the underlying integer codes, use unclass(). Use this with great care.

Specialty functions for making factors

gl(), interaction(), lattice::make.groups()

Obviously not written yet.

How to change factor levels: recoding

recode() from car package

http://citizen-statistician.org/2012/10/20/recoding-variables-in-r-pedagogic-considerations/

Obviously not written yet.

How to change factor levels: dropping a level

## drop Oceania
jDat <- droplevels(subset(gDat, continent != "Oceania"))

There's more to say here but droplevels() is a great start.

How to change factor levels: reordering

It's common and desirable to reorder factor levels rationally, as opposed to alphabetically. In fact, it should be more common! Typical scenario: order the levels of a factor so that a summary statistic of an associated quantitative variable is placed in rank order in data aggregation results or in a figure. This is much easier to see in examples. I'll walk through two and, far below, discuss the reorder() function itself.

Example from the data aggregation tutorial. We fit a linear model of lifeExp ~ year for each country and packaged the estimated intercepts and slopes, along with country and continent factors, in a data.frame. It is sensible to reorder the country factor labels in this derived data.frame according to either the intercept or slope. I chose the intercept.

yearMin <- min(gDat$year)
jFun <- function(x) {
  estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
  names(estCoefs) <- c("intercept", "slope")
  return(estCoefs)
  }
jCoefs <- ddply(gDat, ~ country + continent, jFun)
head(levels(jCoefs$country)) # alphabetical order
## [1] "Afghanistan" "Albania"     "Algeria"     "Angola"      "Argentina"  
## [6] "Australia"
jCoefs <- within(jCoefs, country <- reorder(country, intercept))
head(levels(jCoefs$country)) # in increasing order of 1952 life expectancy
## [1] "Gambia"        "Afghanistan"   "Yemen, Rep."   "Sierra Leone" 
## [5] "Guinea"        "Guinea-Bissau"
head(jCoefs)
##       country continent intercept  slope
## 1 Afghanistan      Asia     29.91 0.2753
## 2     Albania    Europe     59.23 0.3347
## 3     Algeria    Africa     43.37 0.5693
## 4      Angola    Africa     32.13 0.2093
## 5   Argentina  Americas     62.69 0.2317
## 6   Australia   Oceania     68.40 0.2277

Note that the row order of jCoefs is not changed by reordering the factor levels. I could choose to reorder the rows of the data.frame, based on the new, rational level order of country. I do below using arrange() from plyr, which is a nice wrapper around the built-in function order().

# assign the arrange() result back to jCoefs to make the new row order "stick"
head(arrange(jCoefs, country)) 
##         country continent intercept  slope
## 1        Gambia    Africa     28.40 0.5818
## 2   Afghanistan      Asia     29.91 0.2753
## 3   Yemen, Rep.      Asia     30.13 0.6055
## 4  Sierra Leone    Africa     30.88 0.2140
## 5        Guinea    Africa     31.56 0.4248
## 6 Guinea-Bissau    Africa     31.74 0.2718
tail(arrange(jCoefs, country))
##         country continent intercept  slope
## 137 Switzerland    Europe     69.45 0.2222
## 138     Denmark    Europe     71.03 0.1213
## 139      Sweden    Europe     71.61 0.1663
## 140 Netherlands    Europe     71.89 0.1367
## 141     Iceland    Europe     71.96 0.1654
## 142      Norway    Europe     72.21 0.1319

Example that reorders a factor temporarily "on the fly". This happens most often within a plotting call. Remember: you don't have to alter the factor's level order in the underlying data.frame itself. Building on the first example, let's make a stripplot of the intercepts, split out by continent, and overlay a connect-the-dots representation of the continent-specific averages. See figure on the left below. The erratic behavior of the overlay suggests that the continents are presented in a silly order, namely, alphabetical. It might make more sense to arrange the continents in increasing order of 1952 life expectancy. See figure on the right below.

Sorry folks, experimenting with figure placement! Please tolerate any lack of side-by-side-ness and duplication for the moment.

stripplot(intercept ~ continent, jCoefs, type = c("p", "a"))
stripplot(intercept ~ reorder(continent, intercept), jCoefs, type = c("p", "a"))

stripplot(intercept ~ continent, jCoefs, type = c("p", "a"))
stripplot(intercept ~ reorder(continent, intercept), jCoefs, type = c("p", "a"))

You try: make a similar pair of plots for the slopes.

You will notice that the continents will be in a very different order, if you reorder by intercept vs. slope. There is no single definitive order for factor levels. It varies, depending on the quantitative variable and statistic driving the reordering. In a real data analysis, when you're producing a large number of related plots, I suggest you pick one factor level order and stick to it. In Gapminder, I would use this: Africa, Asia, America, Europe (and I'd drop Oceania). Sometimes a global commitment to constancy -- in factor level order and color usage especially -- must override optimization of any particular figure.

reorder(x, X, FUN, ..., order) is the main built-in function for reordering the levels of a factor x such that the summary statistics produced by applying FUN to the values of X -- when split out by the reordered factor x -- are in increasing order.FUN defaults to mean(), which is often just fine.

Mildly interesting: post-reordering, the factor x will have a scores attribute containing those summary statistics.

Footnote based on hard personal experience: The package gdata includes an explicit factor method for the reorder() generic, namely reorder.factor(). Unfortunately it is not "upwards compatible" with the built-in reorder.default() in stats. Specifically, gdata's reorder.factor() has no default for FUN so, if called without an explicit value for FUN, nothing happens. No reordering. I have found it is rather easy for the package or at least this function to creep onto my search path without my knowledge, since quite a few packages require gdata or use functions from it. This underscores the importance of checking that reordering has indeed occured. If you're reordering 'on the fly' in a plot, check visually. Otherwise, explicitly inspect the new level order with levels() and/or inspect the scores attribute described above. To look for gdata on your search path, use search(). To force the use of the built-in reorder(), call stats::reorder(). Many have tripped up on this:

How to grow factor objects

Try not to. But rbind()ing data.frames shockingly works better / more often than c()ing vectors. But this is a very delicate operation. WRITE MORE.

This is not as crazy/stupid as you might fear: convert to character, grow, convert back to factor.

Obviously not written yet.

References

Data Manipulation with R by Phil Spector, Springer (2008) | author webpage | GoogleBooks search

Lattice: Multivariate Data Visualization with R available via SpringerLink by Deepayan Sarkar, Springer (2008) | all code from the book | GoogleBooks search