Graduate Student Seminar

Probabilistic Modeling of High-Throughput Sequencing Data for Enhanced Understanding of DNA Methylation Heterogeneity

DNA methylation is a key epigenetic mechanism governing gene regulation and cellular identity. Advances in high-throughput sequencing technologies have enabled detailed investigation of methylation landscapes across single cells and complex tissue mixtures. However, the sparsity and noise inherent in single-cell data, as well as the signal distortion in enrichment-based platforms, pose major analytical challenges. This thesis presents two novel statistical frameworks to address these limitations and advance the computational toolkit for DNA methylation analysis.

The first contribution is vmrseq, a probabilistic method and software for detecting variably methylated regions from single-cell bisulfite sequencing data. vmrseq integrates a smoothing-based strategy for candidate region identification with hidden Markov modeling to account for spatial correlation and technical noise. Through extensive benchmarking on synthetic and experimental datasets, vmrseq demonstrates improved precision and biological relevance in identifying methylation heterogeneity, supporting downstream analyses such as unsupervised clustering and cell-type-specific marker discovery.

The second contribution is decemedip, a hierarchical Bayesian model and software for cell type deconvolution of enrichment-based methylation data such as MeDIP-seq. By leveraging reference panels derived from alternative platforms and modeling the complex relationship between methylation levels, CpG density, and read counts, decemedip enables accurate estimation of cell type proportions with uncertainty quantification. Its performance is validated through simulations, cross-platform comparisons, and real-world applications involving patient-derived xenografts and circulating cell-free DNA from cancer cohorts.

Together, these methods address critical gaps in the analysis of high-throughput DNA methylation data, enabling robust detection of epigenetic heterogeneity across biological contexts. The associated open-source software implementations provide practical tools for future epigenomic research and potential clinical applications.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.

Modelling Complex Biologging Data with Hidden Markov Models

To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca.

Abstract: Hidden Markov models (HMMs) are commonly used to identify latent processes from observed time series, but it is challenging to fit them to large and complex time series collected by modern sensors. Using data from threatened resident killer whales (Orcinus orca) off the western coast of Canada as a case study, we provide solutions to three common challenges faced when identifying latent behaviour from complicated biologging data. First, biologging time series often violate common assumptions of HMMs when collected at high frequencies. We thus propose a hierarchical approach which utilizes moving-window Fourier analysis to capture fine-scale dependence structures. Second, modern technology allows researchers to directly label the latent process of interest, but rare labels can have a negligible influence on parameter estimates. We introduce a weighted likelihood approach that increases the relative influence of labelled observations. Third, applying HMMs to large time series is computationally demanding, so we propose a novel EM algorithm that combines a partial E step with variance-reduced stochastic optimization within the M step. These solutions allow researchers to model biologging data with HMMs that are more interpretable, accurate, and efficient to fit than existing methods.

Two MSc student presentations (Charlotte Edgar & Graeme Kempf)

To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca.

Presentation 1

Time: 11:00am – 11:30am

Speaker: Charlotte Edgar, UBC Statistics MSc student

Title: Cellwise Robust Covariance-Regularized Regression for High-Dimensional Data

Abstract: It is common to use regularization methods when dealing with high-dimensional regression problems. The scout family, developed by Witten and Tibshirani in 2009, is a class of covariance-regularized regression procedures suitable for prediction in high-dimensional settings. The scout procedure estimates the inverse covariance matrix through two log-likelihood maximization steps that each allow for regularization and then uses the estimated inverse covariance matrix to obtain estimates of the regression coefficients. The aim of this project was to make the scout procedure robust to cellwise outliers. Cellwise outliers are common in high-dimensional datasets and recent work has led to cellwise robust covariance estimates that could be used in the scout procedure. We assess the predictive performance of robust plug-in estimators and outlier detection methods. The development of a regression method that is robust to cellwise outliers, encourages sparsity, and can be applied in high-dimensional settings would be valuable to many fields, such as genomics, and is an area undergoing current research.

Presentation 2

Time: 11:30am – 12:00pm

Speaker: Graeme Kempf, UBC Statistics MSc student

Title: The impact of disease-modifying drugs for multiple sclerosis on hospitalizations and mortality in British Columbia: A retrospective study using an illness-death multi-state model

Abstract: The efficacy of disease-modifying drugs (DMDs) for multiple sclerosis was established in clinical trials that were short and excluded older individuals and individuals living with comorbidities. This has led to a lack of knowledge of the effects of chronic DMD use and the effects of DMDs on individuals that do not meet the traditional eligibility criteria for clinical trials. Multi-state models are a technique which can advance the understanding of a disease beyond that offered by time-to-event models alone. The long-term, real-world efficacy of DMDs was explored by applying a multi-state model to administrative healthcare data. Whether exposure to any DMD is associated with fewer hospitalizations, shorter hospitalizations, and/or a reduction in the chance of dying inside or outside the hospital was investigated using multi-state techniques such as intensity-based analysis and pseudo-value regression.