The aim of the study was to "generate gene expression profiles of purified photoreceptors at distinct developmental stages and from different genetic backgrounds". The experimental units were mice and the microarray platform was Affymetrix mouse genomic expression array 430 2.0. For more information on this study, please refer to the 2006 publication: http://www.ncbi.nlm.nih.gov/pubmed/16505381 The data is also directly accessible from GEO: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4051 TO DO: can we track down provenance of the files we start with? And/or can we start over with data from a link above and preserve the entire process? There are two main sources of information. Look for data files in "data/photoRec/". 1. The gene expression data itself. The GSE4051_data_RAW.txt file contains expression values of 29949 probes from photoreceptor cells in 39 mice samples. A "cleaned" version, where the columns = variables have been rearranged rationally, is given in GSE4051_data.txt. See the script 'caseStudies/photoRecPreprocess/02-cleanData.R' for details. 2. The metadata file GSE4051_design_RAW.txt describes the experimental condition for each sample. Gene expression was studied at 5 different developmental stages: day 16 of embryonic development (E16), postnatal days 2,6 and 10 (P2, P6 and p10) as well as 4_weeks. Each of these 5 experimental conditions are tested in wild type Nrl mice and knockout Nrl mice. A "cleaned" version, with variables renamed and factors and row order rationalized, is given in GSE4051_design.txt. The same info is preserved in two R-specific formats in files GSE4051_design_DPUT.txt and GSE4051_design.robj. These files have the advantage that, upon import, e.g. the levels of the developmental stage factor will be in chronological order (not alphabetical). See the script 'caseStudies/photoRecPreprocess/01-cleanDesign.R' for details. There are several derived datasets, created from processing the above. GSE4051_MINI.txt holds data for 3 randomly selected probesets, renamed for fun, transposed into convenient column variables and stored together in a data.frame with the experimental condition information for each sample. See the script `03-createMiniDataset.R` for how it was created. TO DO: simple differential expression analysis TO DO: use diff exp analysis to pick 3 more interesting probesets for a new mini dataset HOW TO READ DATA/DESIGN ------------------------ WARNING: It is your responsibility to make sure the working directory is set to where these files live or to edit paths accordingly below! Raw data and design: prDat <- read.table("GSE4051_data_RAW.txt", sep = "\t", header = T, row.names = 1) prDes <- read.table("GSE4051_design_RAW.txt", sep = "\t", header = T, row.names = 1) After the completion of data import, both "data" and "design" objects should be of class "data.frame". > class(prDat) [1] "data.frame" > str(prDat) 'data.frame': 29949 obs. of 39 variables: $ Sample_35: num 7.15 9.22 10.06 8.35 8.45 ... $ Sample_32: num 7.54 9.53 9.92 8.78 8.57 ... < ... snip, snip ... > $ Sample_3 : num 7.16 9.55 9.84 8.33 8.5 ... $ Sample_14: num 7.09 9.56 9.88 8.57 8.59 ... > class(prDes) [1] "data.frame" > str(prDes) 'data.frame': 39 obs. of 2 variables: $ developmentStage : Factor w/ 5 levels "4_weeks","E16",..: 1 1 1 1 1 1 1 1 2 2 ... $ genotypeOrVariation: Factor w/ 2 levels "Nrl_deficient",..: 1 1 1 1 2 2 2 2 1 1 ... Cleaned data and design (saved in various formats): prDat <- read.table("photoRec/GSE4051_data.txt") str(prDat) ## 'data.frame': 29949 obs. of 39 variables: ## $ Sample_20: num 7.24 9.48 10.01 8.36 8.59 ... ## $ Sample_21: num 7.41 10.02 10.04 8.37 8.62 ... ## ... ## $ Sample_2 : num 7.35 9.66 9.91 8.4 8.37 ... ## $ Sample_9 : num 7.32 9.8 9.85 8.4 8.46 ... prDes <- read.table("GSE4051_design.txt") str(prDes) ## 'data.frame': 39 obs. of 3 variables: ## $ sample : int 20 21 22 23 16 17 6 24 25 26 ... ## $ devStage: Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ... ## $ gType : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ... In the above case, note that the factor levels for devStage and gType may not be as you want. Wild type ('wt') is not the reference level and the developmental stages are not in chronological order. Set explicitly or import from a format that preserves factor levels (see below). prDes <- dget("GSE4051_design_DPUT.txt") load("GSE4051_design.robj") Both the dput/dget and the save/load approaches will leave the prDes data.frame like so: str(prDes) ## 'data.frame': 39 obs. of 3 variables: ## $ sample : num 20 21 22 23 16 17 6 24 25 26 ... ## $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ... ## $ gType : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ... Mini dataset: The usual problem with factor level order occurs here: > read.table("GSE4051_MINI.txt") > str(kDat) 'data.frame': 39 obs. of 6 variables: $ sample : int 20 21 22 23 16 17 6 24 25 26 ... $ devStage : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ... $ gType : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ... $ crabHammer: num 10.22 10.02 9.64 9.65 8.58 ... $ eggBomb : num 7.46 6.89 6.72 6.53 6.47 ... $ poisonFang: num 7.37 7.18 7.35 7.04 7.49 ... Importing from these R-specific formats preserves factor levels: dget("GSE4051_MINI_DPUT.txt") load("GSE4051_MINI.robj") str(kDat) 'data.frame': 39 obs. of 6 variables: $ sample : num 20 21 22 23 16 17 6 24 25 26 ... $ devStage : Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ... $ gType : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ... $ crabHammer: num 10.22 10.02 9.64 9.65 8.58 ... $ eggBomb : num 7.46 6.89 6.72 6.53 6.47 ... $ poisonFang: num 7.37 7.18 7.35 7.04 7.49 ...