The quantity and variety of genomics datasets has increased tremendously in the last decade, presenting novel opportunities both for deriving cellular pathways and networks, and for identifying genetic and cellular mechanisms that underlie disease. However, interpreting this data to extract biological insights requires disentangling meaningful, and hence reproducible and consequential associations, from mere correlations (i.e. spurious associations). In this talk, I will present computational and machine learning approaches for leveraging prior biological knowledge, while integrating heterogeneous data, in order to find robust associations. In particular, I will first describe a scalable method for graph-based integration of diverse types of genomics data, in order to accurately infer functional roles for uncharacterized genes based on a small set of known (training) genes. This approach results in the state of the art for automatically leveraging the continuous production of new genomics data. Secondly, focusing on the task of finding associations between genetic variation and cellular (expression) traits in a population-based study, I will present methods for using known confounding factors in order to infer and account for hidden confounding factors. Thirdly, expanding this task to the context of a specific disease, I will describe a project that combines genotype, RNA-sequencing, and environmental data to find genes and pathway that correlate with disease status. Application of this approach to a large case/control study of major depression, a highly confounded disorder, sheds new light on molecular mechanisms associated with this pathology.
Integrating multiple types of genomics data to disentangle meaningful associations
Tuesday, January 14, 2014 - 11:00
Sara Mostafavi, Post Doc in the DAGS Group, Stanford University
Room 4192, Earth Sciences Building (2207 Main Mall)