News & Events

Subscribe to email list

Please select the email list(s) to which you wish to subscribe.

User menu

You are here

A Data-Driven Ensemble Framework for Modeling High-Dimensional Data: Theory, Methods, Algorithms and Applications

Tuesday, August 23, 2022 - 11:00 to 12:00
Anthony-Alexander Christidis, UBC Statistics PhD Student
Zoom / ESB 4192

To Join Via Zoom: To join this seminar virtually, please request Zoom connection details from headsec [at] stat.ubc.ca.

Abstract: Sparse and ensemble methods are the two main approaches in the statistical literature for modeling high-dimensional data. On the one hand, sparse methods yield a single predictive model that is generally interpretable and possesses desirable theoretical properties. On the other hand, multi-model ensemble methods can generally achieve superior prediction accuracy, but current ensemble methodology relies on randomization or boosting to generate diverse models which results in uninterpretable ensembles. The diverse models generated by these “black box” algorithms are not insightful on their own and are only useful when they are pooled together.

In this dissertation, we introduce a new data-driven ensemble framework that combines ideas from sparse modeling and ensemble modeling. We search for optimal ways to select and split the candidate predictors into subsets for the different models that will be combined in an ensemble. Each model in the ensemble provides an alternative explanation for the relationship between the predictor variables and the response variable of interest. The degrees of sparsity of the individual models and diversity among the models are both driven by the data. The task of optimally splitting the candidate predictors into subsets results in a computationally intractable combinatorial optimization problem when the number of predictors is large. To demonstrate the potential of an exhaustive search for the optimal split of the predictors into the different models of an ensemble, we test our new approach on specifically designed low-dimensional data which mimic the typical behavior of high-dimensional data such as low signal-to-noise ratio and the presence of spurious correlations.

In this dissertation, we propose different computational approaches to the optimal split selection problem. We first introduce a multiconvex relaxation in the regression case and develop efficient algorithms to compute solutions for any level of sparsity and diversity. We show that the resulting ensembles yield consistent predictions and consistent individual models, and provide empirical evidence that this method outperforms state-of-the-art sparse and ensemble methods for high-dimensional prediction tasks using simulated data and a chemometrics application. We then extend the methodology, theory and algorithms to classification ensembles, and investigate the performance of the method on simulated data and a large collection of gene expression datasets. We finally propose a direct computational approach to calculate approximate solutions to the optimal split selection problem in the regression case and benchmark the performance of the method against the multi-convex relaxation on simulated and gene expression data.

Efficient software libraries with the implementations of the new computational methods provide researchers with tools to (1) achieve state-of-the-art accuracy for high-dimensional prediction tasks, and (2) aid in the scientific discovery of different mechanisms underlying the relationship between a large number of candidate predictors and the response variable of interest.