Registration & talk details
Date: Thursday, April 20, 2023
Time: 5:00 PM - 9:00 PM (Pacific Time)
5:00 - 5:45 PM : Meet-The-Speaker *
5:45 - 6:00 PM : Break
6:00 - 6:05 PM : Seminar begins / Announcements
6:05 - 6:25 PM : Presentation by Trainee Speaker
6:25 - 7:20 PM : Presentation by Featured Speaker
7:20 - 9:00 PM : Seminar ends / Networking + light refreshments
* Pending on the number of RSVPs
Please fill out RSVP form if you are interested in attending this seminar in-person (or the meet-the-speaker session before the seminar at 5PM).
Talk Title: Learning deep biology with shallow statistical models
Abstract: As deep learning approaches gained much popularity, typical statistical learning methods were replaced by deep generative models and black-box classification algorithms in many types of biological data analysis, including genetic variant calling, regulatory genomics, high-dimensional data embedding, missing value imputations, and risk predictions. Despite being readily accessible with general-purpose libraries, so-called deep methods demand long hours of training, specialized hardware resources, and a large amount of data; yet, we are often startled at seeing only marginally improved classification performance, model overfitting, or lack of generalizability. Not undermining important advancements made possible by deep models, this talk will seek to showcase that we can deepen our understanding of biological systems with shallow statistical models.
First, I will discuss our scalable algorithm for probabilistic topic modelling in single-cell genomics data. Based on probabilistic topic assignments in each cell, we identify the latent representation of cellular states and heterogeneity, and the latent topic vectors often yield much-improved clustering results than other types of dimensionality reduction methods. We were initially inspired by several empirical observations: (1) Data sets compressed by repeatedly applying random projection operations highlight cell type-specific signature genes. (2) We also noted that rare cell types are better characterized with lowly-expressed genes (in total data) that are typically removed in quality control steps, thus, not included in embedding model estimations. (3) Bulk sequencing data sets are generally less prone to zero inflation or measurement errors. Based on these key findings, we designed a statistical framework termed ASAP--short for Annotating Single-cell data matrix by Approximate Pseudo-bulk projection) to identify cell topics. ASAP seeks to reduce sample size by interactively collapsing cell-level expression vectors into pseudo-bulk vectors in order to accurately perform non-negative matrix factorization to learn topic-specific gene frequency patterns.
Another example will be a sparse regression model with a causal inference flavour that can effectively handle putative confounding issues in a genetic fine-mapping problem. Our idea is rooted in Rubin's causal inference framework (Rubin and Rosenbaum, 1983), with which non-genetic and indirect genetic effects can be cancelled out in implicit adjustment steps. In order to handle the non-binary nature of exposure variables (genetic dosage), our method first matches individuals with one another based on a covariate similarity matrix. If paired individuals share non-genetic factors, then any gene expression changes between them will be removed so that genetic dosage will become a causal factor in the expression divergence. We implemented the algorithm in the SuSiE framework (Wang et al. 2020).
Overall, this talk will reassure us that it is okay to work on a shallow model. In fact, if our scientific goal is straightforward enough to be coded in a simple model armed with intuitive algorithms, such a modelling approach can help uncover deeper layers of biological mechanisms underneath high-dimensional genomics data.