lattice
Ignore if you don't need this bit of support.
This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the Gapminder project. Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an RStudio project. To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the getwd()
command or by glancing at the top of RStudio's Console pane.
Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodicially copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "code style".
lattice
library(lattice)
## gDat <- read.delim("gapminderDataFiveYear.txt")
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
jDat <- droplevels(subset(gDat, continent != "Oceania"))
str(jDat)
## 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
"Let us change our traditional attitude to the construction of programs. Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do." -- Donald E. Knuth
Your goal in your R scripting is not limited to getting the right results, right now. You need to think bigger than that.
It is easy to underappreciate these considerations when you are new to scripting. But, if I have built up any credibility with you over the past several weeks, please TRUST ME when I say that your coding style is very, very important to the quality of work and your happiness in it.
This has been implicit in everything we have done together and we addressed this briefly very early on when we talked about managing workspaces. Let's be very explicit now:
"The source code is real. The objects are realizations of the source code. Source for EVERY user modified object is placed in a particular directory or directories, for later editing and retrieval." -- from the ESS manual
I like to type commands into the R Console as much as the next person. I do it all the time. The immediacy! The power! This is great for inspection, "thinking out loud" with R, etc. but this is not how you preserve real work. As soon as you've done something of any value, that you might want to repeat, get that code saved into a script ASAP.
How to get code into a script if you've been working in the Console:
I lied above about working in the Console. I do that quite rarely actually. Which brings me to my real recommendation about how to work:
OK, a bit more than source code is real. What else?
In theory, we would stop here: only input data and source code are real. After all, now you should be able to reproduce everything, right? In theory, yes, but harsh reality has softened my hard edges. Let's get pragmatic.
Here in the real world, you will also want to save and protect a few other things:
The form in which you save these things is another important choice. Short version: default should be plain text, human readable, language/software agnostic. I am not naive enough to think that is always possible. Read more on writing information to file here.
Reading code is inherently miserable. Especially if it's someone else's code. Note that your own code feels foreign after a sufficient amount of time. Do everyone a favor and adopt a coding style. Here are some good documents to read and many go beyond simple formatting to address coding practice more generally:
Key principles in code formatting, according to JB:
a <= b
not a<=b
=
is an example of such a binary operator: this = that
not this=that
xyplot(y ~ x, myDat)
not xyplot(y~x,myDat)
use hard line breaks to keep lines 80 characters or shorter; if this breaks up a function call, use indenting to visually indicate the continuation; example:
## drop some observations and unused factor levels
lotrDat <-
droplevels(subset(lotrDat,
!(Race %in% c("Gollum", "Ent", "Dead", "Nazgul"))))
use indenting when you are, e.g. inside a block delimited by curly braces; example:
jFun <- function(x) {
estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
names(estCoefs) <- c("intercept", "slope")
return(estCoefs)
}
develop a convention for comments: e.g. use ##
to begin any full lines of comments and indent them as if they were code; use #
plus a specific horizontal position to append a comment to a line containing code; example:
## here is a pure comment line
myName <- "jenny" # this is, in fact, my name
myDog <- "buzzy" # I used to have a dog
If you use an R-aware text editor or IDE, such as RStudio or Emacs Speaks Statistics, then much of the above is automatic or extremely easy.
Structuring a script:
Always load packages at the beginning; If it's fairly self-evident why the package is needed, at least to me, I just load and move on. If it's a specialty or convenience package, then I remind myself what function(s) I'm going to use; this way, if I don't have the package down the road, I can get a sense of how hard it's going to be to fix up and run the code.
library(car) # recode()
knitr
and "Compile Notebook". Your mileage may vary.? do I have anything more to say here, that's not obvious or said better elsewhere by someone else?
I basically present principles verbatim from a conference report "Good Programming Practices in Healthcare: Creating Robust Programs". Shockingly, this document is mostly about SAS (!) but most of their rules are great and apply broadly. Here are my favorites from pages 3 - 4:
subset(myDat, subset = myFactor == someLevel)
or the subset =
argument more generally is a great example of this.Not written yet but I want a placeholder and some notes here.
word demarcation
names for files, identifiers, functions
dates: default to YYYY-MM-DD
2013-10-15_classList-filtered.txt
or block05_getNumbersOut.rmd
: notice how easy it is for me to use regular expressions to split such (file)names into the date/number part and the "what is it" descriptive part, by splitting on underscore _
. That's no accident! Thanks Andy Roth for teaching me that trick!
use sprintf()
literally or as inspiration to pad numbers with trailing zeros, so lexical and numeric order coincide
Wikipedia definition of Magic Numbers: "unique values with unexplained meaning or multiple occurrences which could (preferably) be replaced with named constants"
Why do we avoid Magic Numbers in programming? To make code more transparent, easier to maintain, more robust, more reusable, more self-consistent.
How do we avoid Magic Numbers in programming?
If it's an intrinsic fact about your data, derive it. Example: if you're mapping a factor into colors and you want to use the Dark2
palette from RColorBrewer
, don't just request 5 colors from the palette. Instead, use nlevels()
to programmatically determine how many levels your factor has and take that many colors. If you later decide to drop a factor level or if you copy and paste this code from one project to another, the second approach is vastly superior.
myColors <- brewer.pal(nlevels(myFactor), name = 'Dark2')
If it's a semi-arbitrary choice you're making, make your choice in exactly one, very obvious place, transparently, give it an informative name, and use the resulting object(s) after that. Example: if you're generating some fake data to demonstrate something, don't hard wire a sample size of 15. Instead, make the assignment n <- 15
(the use of n
counts as an informative name, since the association between n
and sample size in statistics is so strong). And from that point on, use n
, e.g. to compute standard errors, form text strings for labelling figures, etc.
n <- 20
x <- rnorm(n)
densityplot(~ x, main = paste("n =", n))
(seMean <- sd(x)/sqrt(n))
set.seed()
write more
Which piece of code and figure would you like to decipher at 3 a.m.? Both are very basic and the figures rather ugly, but only one makes you angry at the person who wrote it. Go ahead and type the extra 158 characters of code ... I'll wait.
## left-hand figure; code contains 44 characters
xyplot(gDat[427:568,5]~log(gDat[427:568,6]))
## right-hand figure; code contains 202 characters
jYear <- 1967
xyplot(lifeExp ~ gdpPercap, gDat,
subset = year == jYear, main = paste("year =", jYear),
group = continent, auto.key = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)))
Big picture: devote a directory on your computer to a conceptual project. Make it literally an RStudio Project, so it's easy to stop and start analyses, with sane handling of the R process's working directory. Give that directory a good name! Eventually, you will probably also want to make this directory a repository under version control, e.g. a Git repository.
Within the project, once you have more than ~10 files, create subdirectories. JB generally has (in rough order of creation and file population):
Note: once you adopt a subdirectory organization, you'll need to visit any files that find files for reading or writing and edit the paths. For this reason, I often skip straight to my organizational strategy from the very start, when I know I'll eventually need it.
knitr
is fairly stubborn about the working directory being that in which the file you're compiling lives. This can be controlled / worked around, but just don't be surprised that this is a point of some friction. You can expect to pay extra special attention to this if you are generating figures and/or caching.
Let's delve a little deeper into the code
directory and talk about breaking the analysis into steps and, therefore, different scripts. At a minimum, I have these phases, each of which will be embodied in one -- or often many more -- files:
You notice that I am creating figures all the time, although eventually in a project I might have a few scripts that are solely devoted to figure-making. In a real-world analysis, each of the phases above will have several associated scripts. For example, I enacted the data cleaning and preparation for the Gapminder data over the course of five separate scripts.
For now, use my Keynote slides or the live Finder to tour a few projects.
Good reads on other people's idea about how to organize an analytical project: