News & Events

Subscribe to email list

Please select the email list(s) to which you wish to subscribe.

User menu

You are here

MSc students' Co-op Presentations

Tuesday, March 12, 2024 - 10:30 to 12:30
Clayton Allard, UBC Statistics MSc student; Xiao (Nicole) Hu, UBC Statistics MSc student; Yicheng Wang, UBC Statistics MSc student; Yixin Zhang, UBC Statistics MSc student
ESB 4192 / Zoom

To join this seminar virtually: Please request Zoom connection details from ea [at] stat.ubc.ca.

Presentation 1

Time: 10:30am – 11:00am

Speaker: Clayton Allard, UBC Statistics MSc student

Title: 2023 co-op term

Abstract: For my 8-month co-op, I worked as a data scientist for the US Department of Defense. I worked with a collection of cybersecurity data referred to as the "Activity Store". The Activity Store collects data on suspicious activity within a customer's network which is then flagged as an alert for review. The problem is that there are far too many alerts to review. My main project was to create a statistical model to predict whether each alert corresponds to malicious or benign activity. The objective was to select alerts to be reviewed that are more likely to be malicious activity, and determine how the alert system can be modified to reduce the traffic entering the Activity Store. As a result, we used a random forest and got significantly better results than the manual rule-based approach that was previously implemented.

Presentation 2

Time: 11:00am – 11:30am

Speaker: Xiao (Nicole) Hu, UBC Statistics MSc student

Abstract: I worked as a data analyst at Staples Lab within the UBC Department of Medicine for an 8-month Co-op term. My primary project focused on examining the risk of illicit drug overdose following ‘Before medically advised’ (BMA) discharge. The BMA discharge or patient-initiated departure is more common among people with problematic drug use and has been reported to increase the risk of death, yet its relationship to subsequent illicit drug overdose remains uncertain. We conducted a cohort analysis and a case-crossover analysis of population-based linked administrative health data to examine the risk of subsequent illicit drug overdose. In the cohort study, we performed a retrospective analysis using administrative health data from a 20% random sample of the population of British Columbia, Canada. We focused on non-elective, non-obstetrical hospitalizations occurring between 2015 and 2019. We used survival analysis to compare the risk of drug overdose in the first 30 days after BMA discharge to the risk of drug overdose after routine physician-advised discharge. In the case cross-over study, we focused on individuals experiencing an overdose between 2016 and 2019 in British Columbia, Canada. We used a conditional logistic regression to compare the likelihood of hospital discharge in the 28 days prior to overdose (the 'pre-overdose interval') to the likelihood of hospital discharge in two self-matched 28-day control intervals ending 26 and 52 weeks prior to overdose. The primary analysis evaluated whether BMA discharge was associated with an increased risk of subsequent drug overdose. A secondary analysis evaluated the association between physician-advised discharge and subsequent drug overdose.

Presentation 3

Time: 11:30am – 12:00pm

Speaker: Yicheng Wang, UBC Statistics MSc student

Title: Data Pipeline Infrastructure Before Any Analytics Models

Abstract: In industrial world, study shows that data scientists and analysts spend 60% of their time on cleaning and organizing data. The quality of data is found to be the key to success of modelling success. This talk will focus on the infrastructure of end-to-end data pipelining. During my co-op in Samsung, I mainly worked on data pipeline construction and airflow unit testing. I used Apache Airflow, Spark and AWS Athena to re-build query engine in recommendation system for 100+ million SmartThings users.  At the end of co-op, I looked into Airflow API document and source code to research on how to build unit tests to shorten time for pipeline development.

Presentation 4

Time: 12:00pm – 12:30pm

Speaker: Yixin Zhang, UBC Statistics MSc student

Title: Two Implementations of Large Language Models

Abstract: Large Language Models have captured widespread attention since the grand entrance of ChatGPT. During my co-op at Landsure Systems, I worked on two distinct projects that in some ways implemented large language models. In the first project, we aimed to detect and highlight discriminatory covenants from approximately 110 million pages of land title contracts. We implemented pre-trained transformer models specialized in sentiment analysis to differentiate between discriminatory and non-discriminatory phrases. In the second project, we implemented retrieval augmented generation on GPT-4 to create a chatbot for land title related queries. The resulting product was able to give more accurate and domain-specific answers with lower hallucination compared to traditional chatbots and base GPT models.