Graduate Students Seminar Series
2012 Jan. ~ Apr.
Thursdays 12-1pm
LSK 301
Yongliang (Vincent) Zhai
9 Feb, 2012
Probabilistic principal component analysisAs reviewed by Yumi last week, principal component analysis (PCA) is widely used in multivariate statistics for dimensionality reduction. However, a notable feature of the definition of PCA is the absence of an associated probabilistic model for the observed data.
In this talk, we will introduce the probabilistic principal component analysis proposed by Tipping and Bishop (1999) and a generalization of the probabilistic principal component analysis to the exponential family, especially for binary data. We will also discuss the supervised dimensionality reduction when the objective is prediction rather than exploratory data analysis. A data analysis problem in population genetics is presented as an illustration.
Yumi Kondo
2 Feb, 2012
Sparse Principal Component AnalysisThe principal component analysis (PCA) has been widely used to understand complex multivariate datasets. PCA seeks for the uncorrelated linear combination, called principal component (PC), of original variables whose variation is as large as possible. The coefficients of the linear combination, called loadings, give insight of the correlation structure hidden in datasets. However, PCA has an obvious drawback: each PC is a linear combination of all original variables and loadings are typically nonzero. This could make the interpretation of PC difficult. Hui Zou et al. [2006] prosed sparse principal component analysis to overcome this issue. This presentation reviews SPCA then discuss the possibility of extending its idea to the functional space. To this end, I will first review the traditional principal component analysis, and regularization techniques in regression settings, such as Lasso, ridge regressions and elastic-net.
Chen Xu
12 Jan, 2012
Penalized Likelihood Method for Variable Selection with Large Model SpacesHigh-dimensional variable selection plays an important role in knowledge discovery and contemporary scientific researches. Traditional selection procedures, such as the best subset selection and the stepwise regression, can be computationally expensive or unstable in the selecting process. Instead, the penalized likelihood method (PLM) has drawn a great deal of attentions. In this talk, I am going to give a brief review of PLM and introduce a sure screening-based procedure for the implementation of PLM in ultra-high dimension situations. The effectiveness of proposed method is supported by both theoretical verifications and numerical studies.
Graduate Students Seminar Series
2011 Sep. ~ Dec.
Thursdays 12-1pm
LSK 301
24 Nov, 2011
Tianji ShiComputer Experiments II:
Unlike statistical models, in which randomness or uncertainties are modeled using probability functions. Computer (or numerical) models mathematically implement series of "scientific formulas" to deliver deterministic outputs for given inputs. In reality however, computer models are often complex and time consuming to run, thus model outputs at most input conditions are left unknown. In a paper by Sacks et al (1989) titled "Design and Analysis of Computer Experiments", authors developed a computationally cheap statistical model to emulate the outputs of complex computer models, which allowed for variety of model analysis that were previously impossible.
In my last presentation, I talked about aforementioned statistical model and its applications in Ozone modelling. This talk is a follow-up, in which I'll be discussing the more complex case where the computer model being statistically emulated is a time-series process (dynamic model). An example of its application in the field of agricultural crop yield will also be presented.
17 Nov, 2011
Tara (Yanling) CaiDesign a Design:
The second "Design" in the title refers to the design which you may be familiar through courses in Experimental Design or research. However my talk would focus on the first "Design" - carrying out a design. In summer 2011, I was fortunate to involve in conducting an experiment in FPInnovations as the student project leader. Issues can be easily raised at any stage of conducting an experiment. For example, our machine broke once during the experiment. How would you react or response to such an emergency. While coping with the problems, you'll soon realize that how important being organized is, which leads to the question "What should be organized?". In this student seminar, I'll talk about problems in real experiment and provide personal comments/suggestions for managing a project.
10 Nov, 2011
Jing DongA Further Investigation of Multiple Sclerosis Relapse and Safety Monitoring Guidelines Based on MRI Lesion Activity
Riddell, et al (2011) evaluated a safety guideline commonly used by Data Safety Monitoring Boards (DSMBs) of multiple sclerosis (MS) clinical trials. The guideline flags patients who have an increase of five or more contrast enhancing lesions (CELs) on an MRI above the patient’s baseline level. In our working paper, we extend the results in Riddell, et al (2011). We assess the ability of predicting impending relapse using modified contrast enhancing lesion (CEL) guidelines in Phase II trials. We apply the modified guidelines to a relapsing (R) cohort and a secondary progressive (SP) cohort. For each, we assess the value of the guideline in predicting relapse occurrence in the 28 days following an MRI.
3 Nov, 2011
Ardavan SaeediAn Introduction to Nonparametric Bayesian models
27 Oct, 2011
Song CaiTricks of Using the Servers in our Department
In this talk, I will introduce some tricks to use our servers for mass simulations, which includes the following topics:
1-- How to check who are using servers and what programs they are running. How to check what jobs I am already running. Commands: w, who, top, ps, ps -u <username>, etc.
2-- Submit multiple jobs on a server without using interactive interfaces like screen. For example, the following command is infinitely better than using screen in terms running R jobs on servers: nohup nice -19 R --no-save < example.R > example_out &
3-- How to run say 100 R jobs automatically on a server by always keeping 4 jobs (maximum # of jobs one is allowed to run simultaneously on one server) running until all 100 jobs are done. This is quite complicated. But you don't need to understand the details. A ready-to-use script written by me a few years ago can do this for you. I'll explain how to use it. In my opinion, this is a much better way of doing simple R "parallel" computing than "snow" package in R.
20 Oct, 2011
Pavel KrupskiiFactor copula models for multivariate data
Multivariate normality assumption is widely used for the analysis of multivariate data. The number of dependence parameters is then defined by the structure of the correlation matrix. In case of highly dimensional data it is important to reduce number of parameters in the model imposing special correlation structure. This can be done using factor models when one or more common normal factors define the whole dependence structure. In this talk I will introduce extensions of these models when the assumption of multivariate normality of the data is not valid.
13 Oct, 2011
James ProudfootSynoptic Climatology and Probabilistic Precipitation Downscaling Methods
Global climate models (GCM) offer synoptic scale weather data under different climate scenarios, but often times the grid on which data is available is too sparse to be of real use. The goal of this talk is to introduce the field of climate downscaling, and present a few precipitation downscaling techniques (both spatial and temporal), focusing on my work at Enviroment Canada with exponential dispersion models. Specifically, I'll be discussing some of the aspects of the Tweedie family of distributions which make them a straightforward choice for temporal downscaling with semi-continuous data, and some techniques for scoring different stochastically simulated weather series.
6 Oct, 2011
Davor CubranicBetter research software through lightweight software engineering practices
This talk will introduce a handful of low-pain/high-payoff software engineering practices that will make developing software for your research easier and less-error prone. The focus will be version control via subversion and automation testing.
The slides of this talk is here:
Here are some related useful links:
29 Sep, 2011
Ehsan Karim
Estimating the causal effect of a treatment while time-dependent confounding is present: An illustration of the sequential Cox Approach
22 Sep, 2011
Hongbin ZhangIntroduction of three field studies as a biostatistician
In this talk, I will give an overview of three field studies I involved as biostatistician. Those studies are:
(1) knee oestoarthritis prevalence and prediction;
(2) surgical approach on spine tumor;
(3) direct impact of antenatal depression on postartum parenting stress.
Some background, research outline and object will be briefly introduced and the main focus of this talk is the unpredictable statistical challenges and corresponding practical solutions.
15 Sep, 2011
Eugene Barsky
Learn about your library to save your valuable time and increase your research impact in statistics
In this work shop, We are happy to invite Eugene Barsky, a science librarian and a liaison for our department to talk about graduate publishing, impact factors of journals, h-index of authors, journals and conferences impact in statistics. Moreover, several citation management tools---BibTex, Mendeley and Refworks will be introduced and compared.
2011 Jan. ~ Apr.
Thursdays 12-1pm
LSK 301
13 Jan, 2011
Tianji Shi
Gaussian Process Model: Theory and Application
Unlike statistical models, in which randomness or uncertainties are modeled using probability functions. Computer (or numerical) models mathematically implement series of "scientific formulas" to deliver deterministic outputs for given inputs. In reality however, computer models are often complex and time consuming to run, thus model outputs at most input conditions are left unknown. In a paper by Sacks et al (1989) titled "Design and Analysis of Computer Experiments", authors developed a computationally cheap statistical model to emulate the outputs of complex computer models, which allowed for variety of model analysis that were previously impossible. In this presentation, I will discuss the aforementioned statistical model by briefly describing its principle, derivation, characteristics and an example of real world application. [Slide]
20 Jan, 2011
Luke Bornn
Tools for the Aspiring Academic
I would like to discuss tools I've come across since starting my grad studies (such as cloud services and related tools) that help me in my research. I will discuss the following:
1) Dropbox. If I'd known about Dropbox and related tools when I'd started my grad studies, it would have saved me a lot of time and hassle. If you want to install it before Thursday (and bring your laptop along, if you'd like), here's a link: http://db.tt/YYYQq8c (this link gives you (disclaimer: and me!) an extra 250mb of storage).
2) Google docs. Another powerful cloud-based storage and document-editing solution
3) Parallel (and GPU via cuda) computing on the department's servers. I'll try to run a few examples to demonstrate the potential of parallelization here. [R functions]
4) Westgrid
5) Time and project management software and techniques, with a focus on GTD. These kinds of tools become particularly important when handling a large number of concurrent projects.
6) ArXiv, RSS, and journal subscriptions
7) Yet to be decided.
27 Jan, 2011
Tail-dependence measures: applications in finance
3 Feb, 2011
10 Feb, 2011
17 Feb, 2011
Reading Break!
24 Feb, 2011
Introduction to Survival Analysis
Survival analysis is concerned with modeling time to event data, often of humans but also of components and machines. Basic ideas and concepts in survival analysis will be presented. Two families of famous models in survival analysis: the Cox proportional hazards models and the Accelerated failure time models will be introduced. Useful R functions will also be discussed with an example.
3 Mar, 2011
Aline Tabet
A Bayesian Approach for Discrete Choice Modeling
Discrete Choice Models are popular models in Econometrics and Marketing that are characterized by a discrete response. In this talk, I give a brief introduction to this class of models and define the objectives for inference. Furthermore, I present a brief overview of Bayesian methodology used to address these objectives, discussing the advantages and limitations. Finally, I introduce some of the methods I have been working on in my PhD thesis to handle some of these shortcomings.
10 Mar, 2011
Liangliang Wang
Bayesian Phylogenetic Inference using Sequential Monte Carlo Algorithms on posets
Inferring phylogenetic trees from large-scale molecular sequence data presents challenging statistical and computational problems. Standard Bayesian estimation of phylogenetic trees can handle rich evolutionary models, but requires expensive MCMC simulations. In this talk, I will show that Sequential Monte Carlo (SMC), a fast but previously less broadly applicable alternative to MCMC, can be extended to handle posterior inference over phylogenetic trees. The technique I used to generalize SMC depends on the existence of a general poset structure.
17 Mar, 2011
Causal Inference from Observational Data using Propensity Scores
24 Mar, 2011
Yitian(Sky) LIANGCausal Inference, Self selection and Identification - A Follow-up Talk of Propensity Score Analysis
Triggered by the illuminating talk of propensity score analysis given by Ehsan last week, this presentation aims at a general introduction of causal inference. The purpose of this talk is to deliver a general picture of causal inference, based on some of my readings. In particular, some definitions of causal parameters will be presented. A key feature in causal inference, self-selection, will be illustrated and compared to another issue "non-random sampling". At the end, different identification methods of causal parameters will be introduced, including the propensity score analysis. Causal inference is a huge field in social science, epidemiology, etc. I don't aim at presenting the material in a deep level, as I myself am just a beginner. [slides]
31 Mar, 2011
Multiple Sclerosis (MS) is a serious neurological disease whose onset usually occurs in young adults. Recently, repeated magnetic resonance imaging (MRI) scans are often used to monitor MS. Given the follow-up time, designs with different scanning frequencies can be chosen. This talk will briefly introduce our approach to determine the most efficient design based on the count of new enhancing lesions captured by repeated MRI scans.
If you have any questions, comments regarding "Graduate Students Seminar Series", please redirect them to Liangliang Wang, Qian Ye (Monica) or Yang Liu.
