R coding style and organizing analytical projects

Optional getting started advice

Ignore if you don't need this bit of support.

This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the Gapminder project. Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an RStudio project. To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the getwd() command or by glancing at the top of RStudio's Console pane.

Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodicially copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "code style".

Load the Gapminder data and lattice

library(lattice)
## gDat <- read.delim("gapminderDataFiveYear.txt")
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
jDat <- droplevels(subset(gDat, continent != "Oceania"))
str(jDat)
## 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Deep thoughts

"Let us change our traditional attitude to the construction of programs. Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do." -- Donald E. Knuth

Your goal in your R scripting is not limited to getting the right results, right now. You need to think bigger than that.

It is easy to underappreciate these considerations when you are new to scripting. But, if I have built up any credibility with you over the past several weeks, please TRUST ME when I say that your coding style is very, very important to the quality of work and your happiness in it.

Source is real

This has been implicit in everything we have done together and we addressed this briefly very early on when we talked about managing workspaces. Let's be very explicit now:

"The source code is real. The objects are realizations of the source code. Source for EVERY user modified object is placed in a particular directory or directories, for later editing and retrieval." -- from the ESS manual

I like to type commands into the R Console as much as the next person. I do it all the time. The immediacy! The power! This is great for inspection, "thinking out loud" with R, etc. but this is not how you preserve real work. As soon as you've done something of any value, that you might want to repeat, get that code saved into a script ASAP.

How to get code into a script if you've been working in the Console:

I lied above about working in the Console. I do that quite rarely actually. Which brings me to my real recommendation about how to work:

What else is real

OK, a bit more than source code is real. What else?

In theory, we would stop here: only input data and source code are real. After all, now you should be able to reproduce everything, right? In theory, yes, but harsh reality has softened my hard edges. Let's get pragmatic.

Here in the real world, you will also want to save and protect a few other things:

The form in which you save these things is another important choice. Short version: default should be plain text, human readable, language/software agnostic. I am not naive enough to think that is always possible. Read more on writing information to file here.

Coding style

Reading code is inherently miserable. Especially if it's someone else's code. Note that your own code feels foreign after a sufficient amount of time. Do everyone a favor and adopt a coding style. Here are some good documents to read and many go beyond simple formatting to address coding practice more generally:

Key principles in code formatting, according to JB:

If you use an R-aware text editor or IDE, such as RStudio or Emacs Speaks Statistics, then much of the above is automatic or extremely easy.

Structuring a script:

General principles

I basically present principles verbatim from a conference report "Good Programming Practices in Healthcare: Creating Robust Programs". Shockingly, this document is mostly about SAS (!) but most of their rules are great and apply broadly. Here are my favorites from pages 3 - 4:

Naming things

Not written yet but I want a placeholder and some notes here.

word demarcation

names for files, identifiers, functions

dates: default to YYYY-MM-DD

2013-10-15_classList-filtered.txt or block05_getNumbersOut.rmd: notice how easy it is for me to use regular expressions to split such (file)names into the date/number part and the "what is it" descriptive part, by splitting on underscore _. That's no accident! Thanks Andy Roth for teaching me that trick!

use sprintf() literally or as inspiration to pad numbers with trailing zeros, so lexical and numeric order coincide

Avoid Magic Numbers

Wikipedia definition of Magic Numbers: "unique values with unexplained meaning or multiple occurrences which could (preferably) be replaced with named constants"

Why do we avoid Magic Numbers in programming? To make code more transparent, easier to maintain, more robust, more reusable, more self-consistent.

How do we avoid Magic Numbers in programming?

Make things repeatably random

set.seed()

write more

Use names and aggressively exploit holistic facilities for subsetting, transformation, and labelling

Which piece of code and figure would you like to decipher at 3 a.m.? Both are very basic and the figures rather ugly, but only one makes you angry at the person who wrote it. Go ahead and type the extra 158 characters of code ... I'll wait.

## left-hand figure; code contains 44 characters
xyplot(gDat[427:568,5]~log(gDat[427:568,6]))

## right-hand figure; code contains 202 characters
jYear <- 1967
xyplot(lifeExp ~ gdpPercap, gDat,
       subset = year == jYear, main = paste("year =", jYear),
       group = continent, auto.key = TRUE, 
       scales = list(x = list(log = 10, equispaced.log = FALSE)))

Organization

Big picture: devote a directory on your computer to a conceptual project. Make it literally an RStudio Project, so it's easy to stop and start analyses, with sane handling of the R process's working directory. Give that directory a good name! Eventually, you will probably also want to make this directory a repository under version control, e.g. a Git repository.

Within the project, once you have more than ~10 files, create subdirectories. JB generally has (in rough order of creation and file population):

Note: once you adopt a subdirectory organization, you'll need to visit any files that find files for reading or writing and edit the paths. For this reason, I often skip straight to my organizational strategy from the very start, when I know I'll eventually need it.

knitr is fairly stubborn about the working directory being that in which the file you're compiling lives. This can be controlled / worked around, but just don't be surprised that this is a point of some friction. You can expect to pay extra special attention to this if you are generating figures and/or caching.

Let's delve a little deeper into the code directory and talk about breaking the analysis into steps and, therefore, different scripts. At a minimum, I have these phases, each of which will be embodied in one -- or often many more -- files:

You notice that I am creating figures all the time, although eventually in a project I might have a few scripts that are solely devoted to figure-making. In a real-world analysis, each of the phases above will have several associated scripts. For example, I enacted the data cleaning and preparation for the Gapminder data over the course of five separate scripts.

For now, use my Keynote slides or the live Finder to tour a few projects.

Good reads on other people's idea about how to organize an analytical project:

References