Homework #6 The End: Putting it all together

Homework #6 The End: Putting it all together

Follow the existing homework submission instructions as much as possible, e.g. use systematic, informative filenames. By 12pm Monday October 21, be prepared to share a link via a Google doc. I will announce sharing on the Google Group as usual.

Big picture

Write at least two R scripts to enact an analysis
Output of the first script must be (an) input of the second (and so on ...)
Write numerical data to file; file must be readable by a human and by someone who doesn't use R
Write figure(s) to file
Provide evidence of the following:
- Initial state of directory, containing data and R scripts
- How you run your 2 (or more) scripts non-interactively
- End state of the directory, containing newly written numerical and graphical output
Provide the files themselves and a description of what to do with them

Templates

Bare bones example using only R: hw06_scaffolds/01_justR
Example upgraded to include using Make to run the pipeline: hw06_scaffolds/02_rAndMake
Example compiling an R script and an R Markdown file to HTML via the makefile: hw06_scaffolds/03_knitWithoutRStudio

Please just tell me what to do

If you don't feel like dreaming up your own thing, here's a Gapminder blueprint that is a minimal but respectable way to complete the assignment. You are welcome to remix R code already written by someone, student or JB, in this class, but credit/link appropriately, e.g. in comments.

JB has provided an template, using a different dataset, here, that should help make this concrete.

Write a suitably named R script:
- Import the data.
- Save a couple descriptive plots to file with highly informative names.
- Reorder the continents based on life expectancy. You decide the details.
- Sort the actual data in a deliberate fashion. You decide the details, but this should at least implement your new continent ordering.
- Write the Gapminder data to file(s), for immediate and future reuse.
Write a second suitably named R script:
- Import the data created in the first script.
- Make sure your new continent order is still in force. You decide the details.
- Fit a linear regression of life expectancy on year within each country. Write the estimated intercepts, slopes, and residual error variance (or sd) to file.
- Find the 3 or 4 "worst" and "best" countries for each continent. You decide the details.
- Write the linear regression info for just these countries to file.
- Create a single-page figure for each continent, including data only for the 6-8 "extreme" countries, and write to file. One file per continent, with an informative name. The figure should give scatterplots of life expectancy vs. year, panelling/facetting on country, fitted line overlaid.
Identify and test a method of running your pipeline non-interactively.
- You could write a master R script that simply source()s the two scripts, one after the other. Tip: you will probably want a second "clean up / reset" script that deletes all the output your scripts leave behind, so you can easily test and refine your strategy, i.e. without repeatedly deletng stuff "by hand". You can run the master script or the cleaning script from a shell with R CMD BATCH or Rscript.
Dear Windows people: Recent versions of Windws have PowerShell, whose syntax resembles the Unix-type shells the Mac/Linux folks will have automatically. Try using that. See below for links to documentation on PowerShell and for help getting other software.
Provide a link to a page that explains how your pipeline works and links to the remaining files. Jenny and Song should be able to go to this landing page and re-run your analysis quickly and easily.
- Minimum: I think you can accomplish this with RPubs and Gist, with a carefully constructed README-type file that links to the other files.
- Alternatively, you could put this all in a directory on the web somewhere and submit a link to file that serves as landing page. In that case, it is a very nice touch to offer the files in a browsable form AND as an easily downloaded archive.

I want to aim higher!

Follow the basic Gapminder blueprint above, but find a different data aggregation task, different panelling/facetting emphasis, focus on different variables, etc.

Use non-Gapminder data.

This means you'll need to spend more time on data cleaning and sanity checking. You will probably have an entire script (or more!) devoted to data prep. Examples:
- Are there wonky factors? Consider bringing in as character, addressing their deficiencies, then converting to factor.
- Are there variables you're just willing to drop at this point? Do it!
- Are there dates and times that need special handling? Do it!
- Are there annoying observations that require very special handling or crap up your figures (e.g. Oceania)? Drop them!

Include some dynamic report generation in your pipeline. That is, create HTML from one or more plain R or R markdown files.

Example of how to emulate RStudio's "Compile Notebook" button from a shell:
- Rscript -e "knitr::stitch_rmd('myAwesomeScript.R')"
To emulate "Knit HTML", do something like the above but with knitr::knit2html().
See the Makefile in this example to see these commands in action

Experiment with running R code saved in a script from within R Markdown. Here's some official documentation on code externalization.

Embed pre-existing figures in and R Markdown docuement, i.e. an R script creates the figures, then the report incorporates them. General advice on writing figures to file is here and ggplot2 has a purpose-built function ggsave() you should try. See an example of this in an R Markdown file in one of the examples.

Import pre-existing data in an R Markdown document, then format nicely as a table.

Use Pandoc and/or LaTeX to explore new territory in document compilation. You could use Pandoc as an alternative to knitr for Markdown to HTML conversion; you'd still use knitr for conversion of R Markdown to Markdown. You would use LaTeX to get PDF output from R Markdown.

Start using Git! Create a new Git repository for all the files for this assignment. See below for help.

Start using Github! Share your repo with us on the web! See below for help.

Use Make to run your pipeline. See below for help. Also demonstrated in the example hw06_scaffolds/02_rAndMake and in the example hw06_scaffolds/03_knitWithoutRStudio

How to install and use recommended tools

Setup instructions for Software Carpentry R Bootcamp, Canberra, October 9 - 10 2013 <-- has good info on GitBash, a combined Bash shell + git install for Windows
Additional software installation instructions from above SWC Bootcamp
Microsoft info on PowerShell
An introduction to Git/Github by Karl Broman, aimed at stats / data science types
JB highly recommends the free SourceTree Git client. Available for Windows or Mac.
Ram, K 2013. Git can facilitate greater reproducibility and increased transparency in science. Source Code for Biology and Medicine 2013 8:7. Go to the associated github repo to get the PDF (link at bottom of README) and to see a great example of how someone managed the process of writing a manuscript with git(hub).
Blog post Version control for scientific research by Karthik Ram and Titus Brown on the BioMed Central blog February 28
Blog post Getting Started With a GitHub Repository from ProfHacker on chronicle.com looks helpful
Need a chuckle? xkcd on Git commit messages
An introduction to Make by Karl Broman, aimed at stats / data science types
Blog post Using Make for reproducible scientific analyses by Ben Morris
Slides on Make from Software Carpentry
Mike Bostock on Why Use Make
Zachary M. Jones on GNU Make for Reproducible Data Analysis
Ben Bolker RPub Literate programming, version control, reproducible research, collaboration, and all that
A set-up free way to play with Git: Code School - Try Git