Graduate Student Seminar

Asymptotically exact variational inference via measure-preserving dynamical systems

Variational inference (VI) approximates a target distribution within a chosen family that permits i.i.d. sampling and tractable density evaluation. Because the approximation is obtained by minimizing a divergence to the target, its best achievable quality is constrained by the family’s expressiveness. Yet greater flexibility does not guarantee better results: the optimization landscape is typically highly non-convex, so the theoretical optimum is rarely attained in practice. Consequently, VI generally lacks the asymptotic exactness of Markov chain Monte Carlo (MCMC)—the ability to achieve arbitrarily accurate inference given sufficient computation, regardless of tuning.

In this talk, I will introduce mixed variational flows (MixFlows): a framework for constructing tuning-free, asymptotically exact variational families using measure-preserving dynamical systems. The key methodological advance is a way to use involutive MCMC kernels to build variational flows, yielding families that inherit MCMC-level convergence guarantees while retaining VI’s tractability (i.i.d. sampling and closed-form density evaluation).

I will also discuss how tools from chaotic dynamical systems illuminate the propagation of probabilistic error through \emph{inexact} flows—errors that arise from finite-precision arithmetic and numerical discretization—providing practical guidance for when flow-based approximations remain reliable in spite of numerical instability. 

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

AutoStep: Locally adaptive involutive MCMC

Many common Markov chain Monte Carlo (MCMC) kernels can be formulated using a deterministic involutive proposal with a step size parameter. Selectingan appropriate step size is often a challenging task in practice; and for complex multiscale targets, there may not be one choice of step size that works well globally. In this work, we address this problem with a novel class of involutive MCMC methods—AutoStep MCMC—that selects an appropriate step size at each iteration adapted to the local geometry of the target distribution. We prove that under mild conditions AutoStep MCMC is π-invariant, irreducible, and aperiodic, and obtain bounds on expected energy jump distance and cost per iteration. Empirical results examine the robustness and efficacy of our proposed step size selection procedure, and show that AutoStep MCMC is competitive with state-of-the-art methods in terms of effective sample size per unit cost on a range of challenging target distributions.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.

Regularized Relative Risk Regression

The relative risk (RR) offers interpretation and comparison advantages over the Odds Ratio (OR) used in logistic regression. However, its direct estimation in high-dimensional settings is challenging. Common approaches, such as penalized log-binomial and Poisson regression, are built on parameters that are variationally dependent, while newer, variation-independent models have been limited by estimators not designed for high-dimensional or sparse data.

To address this, this project built on previous penalized RR models to implement a faster penalized estimator for the variation-independent relative risk model. The contributions include an efficient implementation in C++, the use of an Adaptive Step Size FISTA algorithm for robust optimization, and a comprehensive evaluation of different penalization strategies and model specifications. Through simulation studies, the proposed estimator is shown to be a robust tool for high-dimensional analysis. It demonstrates better predictive accuracy and the ability to identify relevant predictors in sparse scenarios correctly.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

Sequential Monte Carlo - EM algorithm for Disease Transmission Models

Estimating the parameters of disease transmission models is an important component in analyzing disease outbreaks and inferring transmission networks. The introduction of genetic data into disease transmission models has enabled more detailed inference, particularly through phylogenetic trees derived from the genetic data. Existing approaches often rely on a single phylogenetic tree to subset transmission trees from a set of possible transmission trees inferred from epidemiological data. However, such methods typically do not account for the uncertainty inherent in phylogenetic reconstruction. This thesis introduces a Sequential Monte Carlo-Expectation Maximization (SMC-EM) framework that explicitly incorporates uncertainty in transmission and phylogenetic trees. We treat these trees as latent variables and use observed genetic sequences, sampling times, and epidemiological data to inform the model. Our method constructs transmission and phylogenetic trees sequentially, conditioned on infection times, and updates parameter estimates iteratively via a variant of the EM algorithm. We evaluate the performance of the proposed method through extensive simulation studies and demonstrate its applicability using a real-world outbreak dataset. The results indicate that the SMC-EM approach provides improved parameter estimates while effectively capturing the uncertainty in latent tree structures. 

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

Probabilistic Modeling of High-Throughput Sequencing Data for Enhanced Understanding of DNA Methylation Heterogeneity

DNA methylation is a key epigenetic mechanism governing gene regulation and cellular identity. Advances in high-throughput sequencing technologies have enabled detailed investigation of methylation landscapes across single cells and complex tissue mixtures. However, the sparsity and noise inherent in single-cell data, as well as the signal distortion in enrichment-based platforms, pose major analytical challenges. This thesis presents two novel statistical frameworks to address these limitations and advance the computational toolkit for DNA methylation analysis.

The first contribution is vmrseq, a probabilistic method and software for detecting variably methylated regions from single-cell bisulfite sequencing data. vmrseq integrates a smoothing-based strategy for candidate region identification with hidden Markov modeling to account for spatial correlation and technical noise. Through extensive benchmarking on synthetic and experimental datasets, vmrseq demonstrates improved precision and biological relevance in identifying methylation heterogeneity, supporting downstream analyses such as unsupervised clustering and cell-type-specific marker discovery.

The second contribution is decemedip, a hierarchical Bayesian model and software for cell type deconvolution of enrichment-based methylation data such as MeDIP-seq. By leveraging reference panels derived from alternative platforms and modeling the complex relationship between methylation levels, CpG density, and read counts, decemedip enables accurate estimation of cell type proportions with uncertainty quantification. Its performance is validated through simulations, cross-platform comparisons, and real-world applications involving patient-derived xenografts and circulating cell-free DNA from cancer cohorts.

Together, these methods address critical gaps in the analysis of high-throughput DNA methylation data, enabling robust detection of epigenetic heterogeneity across biological contexts. The associated open-source software implementations provide practical tools for future epigenomic research and potential clinical applications.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

Modelling Complex Biologging Data with Hidden Markov Models

To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca.

Abstract:  Hidden Markov models (HMMs) are commonly used to identify latent processes from observed time series, but it is challenging to fit them to large and complex time series collected by modern sensors. Using data from threatened resident killer whales (Orcinus orca) off the western coast of Canada as a case study, we provide solutions to three common challenges faced when identifying latent behaviour from complicated biologging data. First, biologging time series often violate common assumptions of HMMs when collected at high frequencies. We thus propose a hierarchical approach which utilizes moving-window Fourier analysis to capture fine-scale dependence structures. Second, modern technology allows researchers to directly label the latent process of interest, but rare labels can have a negligible influence on parameter estimates. We introduce a weighted likelihood approach that increases the relative influence of labelled observations. Third, applying HMMs to large time series is computationally demanding, so we propose a novel EM algorithm that combines a partial E step with variance-reduced stochastic optimization within the M step. These solutions allow researchers to model biologging data with HMMs that are more interpretable, accurate, and efficient to fit than existing methods.

Two MSc student presentations (Charlotte Edgar & Graeme Kempf)

To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca.

Presentation 1

Time: 11:00am – 11:30am

Speaker: Charlotte Edgar, UBC Statistics MSc student

Title: Cellwise Robust Covariance-Regularized Regression for High-Dimensional Data

Abstract: It is common to use regularization methods when dealing with high-dimensional regression problems. The scout family, developed by Witten and Tibshirani in 2009, is a class of covariance-regularized regression procedures suitable for prediction in high-dimensional settings. The scout procedure estimates the inverse covariance matrix through two log-likelihood maximization steps that each allow for regularization and then uses the estimated inverse covariance matrix to obtain estimates of the regression coefficients. The aim of this project was to make the scout procedure robust to cellwise outliers. Cellwise outliers are common in high-dimensional datasets and recent work has led to cellwise robust covariance estimates that could be used in the scout procedure. We assess the predictive performance of robust plug-in estimators and outlier detection methods. The development of a regression method that is robust to cellwise outliers, encourages sparsity, and can be applied in high-dimensional settings would be valuable to many fields, such as genomics, and is an area undergoing current research.

Presentation 2

Time: 11:30am – 12:00pm

Speaker: Graeme Kempf, UBC Statistics MSc student

Title: The impact of disease-modifying drugs for multiple sclerosis on hospitalizations and mortality in British Columbia: A retrospective study using an illness-death multi-state model

Abstract: The efficacy of disease-modifying drugs (DMDs) for multiple sclerosis was established in clinical trials that were short and excluded older individuals and individuals living with comorbidities. This has led to a lack of knowledge of the effects of chronic DMD use and the effects of DMDs on individuals that do not meet the traditional eligibility criteria for clinical trials. Multi-state models are a technique which can advance the understanding of a disease beyond that offered by time-to-event models alone. The long-term, real-world efficacy of DMDs was explored by applying a multi-state model to administrative healthcare data. Whether exposure to any DMD is associated with fewer hospitalizations, shorter hospitalizations, and/or a reduction in the chance of dying inside or outside the hospital was investigated using multi-state techniques such as intensity-based analysis and pseudo-value regression.

Tags