To join this seminar virtually: Please request Zoom connection details from ea [at] stat.ubc.ca
Abstract: With the ubiquity of data, linking data sets has become crucial for myriad applications including healthcare, official statistics, ecology, fraud detection, and national security. Record linkage is the task of resolving duplicates in two or more partially overlapping sets of records, or files, from noisy data sources without a unique identifier. In any field where multiple sources of messy data are available to address a scientific problem, record linkage is a critical step in the analysis pipeline. In streaming record linkage, files arrive sequentially in time and estimates of the linkage structure are updated after the arrival of each file. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrive. In this talk, I present the first multi-file Bayesian record linkage model formulated specifically for the streaming data context. This model is fit using recursive updates, incorporating each new batch of data into the model parameters' posterior distribution. A novel Markov chain Monte Carlo algorithm is presented that performs recursive Bayesian updates while avoiding the issue of degradation, common to many recursive algorithms. This sampler achieves near-equivalent posterior inference to non-streaming algorithms at a small fraction of the compute time.