News & Events

Subscribe to email list

Please select the email list(s) to which you wish to subscribe.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA

Enter the characters shown in the image.

User menu

You are here

Scaling Bayesian Record Linkage for Streaming Data Contexts

Tuesday, January 28, 2025 - 10:30 to 11:30
Andee Kaplan, Assistant Professor, Department of Statistics, Colorado State University
Statistics Seminar
ESB 4192 / Zoom

To join this seminar virtually: Please request Zoom connection details from ea [at] stat.ubc.ca

Abstract: With the ubiquity of data, linking data sets has become crucial for myriad applications including healthcare, official statistics, ecology, fraud detection, and national security. Record linkage is the task of resolving duplicates in two or more partially overlapping sets of records, or files, from noisy data sources without a unique identifier. In any field where multiple sources of messy data are available to address a scientific problem, record linkage is a critical step in the analysis pipeline. In streaming record linkage, files arrive sequentially in time and estimates of the linkage structure are updated after the arrival of each file. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrive. In this talk, I present the first multi-file Bayesian record linkage model formulated specifically for the streaming data context. This model is fit using recursive updates, incorporating each new batch of data into the model parameters' posterior distribution. A novel Markov chain Monte Carlo algorithm is presented that performs recursive Bayesian updates while avoiding the issue of degradation, common to many recursive algorithms. This sampler achieves near-equivalent posterior inference to non-streaming algorithms at a small fraction of the compute time.