Complete the exercises from the first two weeks. You should be comfortable with starting R, basic syntax and data types, subsetting matrices, and reading and writing data files, as well as basic string manipulation.
Download the sample data file AIDS data. Ensure you can read it in using read.csv
. It is used in the examples below.
The second example data set, which will be used in class, is section3.example.madata.zip, a moderately-sized microarray data set. Make sure you can load it with 'read.table' or 'read.delim' (unzip it first!). It has one row of headings (with sample names) and the row names are in the first column. It has 10,000 rows and 59 columns, so it's reasonably realistic. There is also a small metadata file.
Install the following packages from CRAN: RColorBrewer
, gplots
, sciplot
.
Create the average density from last week's exercise, also taking logs of the DNase concentration and optical density:
DNase.log <- DNase
DNase.log$conc <- log(DNase$conc)
DNase.log$density <- log(DNase$density)
dens.avg.frame <- aggregate(DNase.log$density, list(DNase.log$conc), mean)
names(dens.avg.frame) <- c("dnase.conc", "dens.avg")
Now try:
plot(dens.avg.frame)
Let's be clear about what's on this graph:
plot(dens.avg.frame, xlab = "log(DNase concentration)", ylab = "log(Average density)")
And give it an appropriate title:
plot(dens.avg.frame, xlab = "log(DNase concentration)", ylab = "log(Average density)",
main = "Average optical density of DNA solution treated with DNase at varying concentrations")
plot()
creates scatterplots by default, but is a "generic" command that can do a variety of things. Controlling plotting is done (in large part) by using settings documented in the "par" command.
You can make different types of plots, e.g. points, lines or both:
plot(dens.avg.frame, type = "p")
plot(dens.avg.frame, type = "l")
plot(dens.avg.frame, type = "b")
You can control the character for points using pch
. You can use text characters, or the built-in plot characters (from 1-20). Try these:
plot(dens.avg.frame, type = "b", pch = 3)
plot(dens.avg.frame, type = "b", pch = "R")
You can also control the line type using lty
(from 1-6):
plot(dens.avg.frame, type = "b", lty = 3)
Exercise 1: Using the AIDs data you loaded from the CSV file, plot a line and point plot of AIDS cases in New York ('New York, NY') by 'Year'. Use diagonal crosses (like an 'x'), and a broken line. Give your plot appropriate axis labels and a title. Refer to the
par
documentation as necessary.
Let's take just the DNase readings for DNase concentration=6.25, and plot them:
DNase.625 <- DNase[which(DNase$conc == 6.25), ]
barplot(DNase.625$density, names.arg = DNase.625$Run)
How about combining the duplicates, and adding error bars using sciplot:
library(sciplot)
## Error: package 'sciplot' was built before R 3.0.0: please re-install it
bargraph.CI(DNase.625$Run, DNase.625$density, err.width = 0.05)
## Error: could not find function "bargraph.CI"
Exercise 2:
- Take the AIDs statistics for 1994. Plot a barplot of all locations, without X axis labels, in decreasing order by incidence.
- Then find the 5 locations with the highest incidence for that year, and plot a bar chart of those, with axis labels.
- Label your X and Y axes, and add a title to each graph.
Very commonly you will want to look at the distribution of some data. Dr. Bryan showed several examples in her first lecture.
Histograms are a basic way of looking at the distribution of values:
hist(DNase.625$density)
Use the density
command to make a "smoothed" curve. This is useful when you want to view several simultaneously since they don't "overlap" as badly as histograms.
plot(density(DNase.625$density))
Exercise 3: What does
density
do by itself, without the call to plot?
Another valuable tool is boxplot
, which we leave as an exercise.
Exercise 4: Investigate the
boxplot
command. Make a boxplot that displays the distributions of the "conc" and "density" data from DNase.log. How do you read a box plot?
You can use par()
with the arguments mfrow
or mfcol
to set up a panel of plots:
par(mfrow = c(2, 1))
bargraph.CI(DNase.625$Run, DNase.625$density, err.width = 0.05)
## Error: could not find function "bargraph.CI"
hist(DNase.625$density)
Also see layout()
for more advanced placement of multiple graphs.
It is common to want to display more than one set of data on the same set of axes.
R offers a command, par(new=T)
, which stops the display from clearing between successive calls to plotting commands. But you should NOT use this. To see why try:
# NOTE: Does not do what we want!
plot(dens.avg.frame, type = "b", pch = 3)
par(new = T)
plot(DNase.log[which(DNase.log$Run == 4), c("conc", "density")], pch = 7)
Instead, call plot only once and add data to the graph:
points
and lines
allow you to add points to an existing plot:
plot(dens.avg.frame, type = "b", pch = 3)
points(DNase.log[which(DNase.log$Run == 4), c("conc", "density")], pch = 7)
plot(dens.avg.frame, type = "b", pch = 3)
lines(DNase.log[which(DNase.log$Run == 4), c("conc", "density")], pch = 7)
Sometimes a good strategy is to make an empty set of axes (use type='n'
), and then add data to it. Here we also demonstrate control of axis limits.
plot(0, xlab = "DNase concentration", ylab = "Average density", type = "n",
xlim = c(-3, 3), ylim = c(-3, 1))
points(dens.avg.frame, type = "b", pch = 3)
lines(DNase.log[which(DNase.log$Run == 4), c("conc", "density")], pch = 7, type = "p")
Exercise 5: Experiment with changing some of the settings in this last example, such as axis limits, colours, and line thickness. You'll probably need to look at the documentation for
par
.
More advanced: If you are building many plots together, you can use apply
. Using the "empty axis" method is pretty much essential. Here we plot densities of both of our data columns at once.
plot(0, type = "n", xlim = c(-5, 5), ylim = c(0, 0.5))
apply(DNase.log[, 2:3], 2, function(x) lines(density(x)))
R offers the "heatmap" command to make "false colour" images of data. You may want to use heatmap2
from the gplots
library (see Additional graphing libraries below).
We'll learn more about heatmaps in class. For now, just look at the documentation for heatmap and try some of the examples they provide there.
In general, specify a device, plot, then dev.off()
.
Do not forget dev.off()
! R will not finish saving the file until you call it!
We recommend using PNG output for convenient viewing and sharing of "draft" figures. Generally when you need high-quality output you will use a vector-based format (see below).
png("avg_density.png", width = 800, height = 800)
plot(dens.avg.frame, type = "b", lty = 3)
dev.off()
Default width/height is 480 x 480. Blocky, but OK for quick viewing and putting on web pages.
bmp()
, tiff()
and jpeg()
work similarly. However, PNG is best - compressed but not lossy. Note that many (most?) journals will not accept PNG for figures. TIFF is common but generates big files.
These are resolution-independent formats. This means that no matter how much you "blow up" the image, it will always look crisp. Since most graphs are really vector graphics (as opposed to being like photographs), you should take advantage of this to make your images look good in print.
pdf("avg_density.png")
plot(dens.avg.frame, type = "b", lty = 3)
dev.off()
Use postscript
to do PS/EPS.
In most cases use postscript or PDF (vector means higher quality). Postscript is the format most journals like to get for any kind of line art. PDF is basically a variation on postscript, but journals don't like it for final products.
For very complex plots, a high-resolution
PNG may be a good choice. A plot with a million points will be end up being a very big postscript file that takes a long time to display. On the other hand, plotting a million points probably isn't what you want to do anyway: consider a data reduction step.
svg()
may also be good if you want to edit your graph further (for instance in Inkscape)
Trying to make your graph figure-quality perfect in R can be tricky; it can be easier to do some touch-up in a program like Inkscape or Adobe Ilustrator. If you are working with postscript on Windows or MacOS, Adobe Illustrator is an excellent (but non-free) choice.
Exercise 6: Make sure you can view PNG and PDF files you create from R. For PNGs, you can use your web browser or any graphics program.
These add-ons to R can be very powerful, but have their own learning curves.
Creates grids of plots from a single data set, where the data is subsetted based on variables you specify.
Newer, allows finer control, emphasises information visualization best practices.
Most commands are similar to standard R methods, but with a lot of extras.
For making heatmaps, offers heatmap2, which is better than the built-in R heatmap() because it makes a scale bar. Unfortunately it also does other stuff by default, which need to be switched off (trace="none", dendrogram="none")
Other useful graphing libraries include
hexbin, to make scatter plots useful when there are many points.
ggplot2, which has many sophisticated ways of making graphs.
For interactive graphics, you can try rggobi, which is a R interface to GGobi. Installing it on Windows can be finicky.
Paul may show some examples in lecture.
R Graphics, Paul Murrell - available online if within UBC network
R Graph Gallery - huge gallery of pretty plots, with source code!
Controlling Layouts - Good explanation how to lay out figures.