Many of the variables relevant to online advertising have heavy tails. Keywords range from very frequent to obscure. Advertisers span a great size range.
Host web sites range from very popular to rarely visited.
Much is known about the statistical properties of heavy tailed random variables. The Zipf distribution and Zipf-Mandelbrot distribution are frequently good approximations.
Much less attention has been paid to the joint distribution of two or more such quantities. In this work, we present a graphical display that shows the joint behavior of two long tailed random variables. For ratings data (Netflix movies, Yahoo songs) we often see a strong head to tail affinity where the major players of one type are over-represented with the minor players of the other. We look at several examples which reveal properties of the mechanism underlying the data. Then we present some mathematical models based on bipartite preferential attachment mechanisms and a Zipf-Poisson ensemble.
This is joint work with Justin Dyer.
Meta-analysis is a statistical method that is used to combine the results of different studies in order draw conclusions about a body of research.
For example, one might imagine extracting hazard ratios and odds ratio from a collection of different health research papers looking at the
effectiveness and safety of a drug (e.g. antidepressants). An emerging area of innovation in statistics involves meta-analysis of observational studies.
Unlike randomized controlled trials, which are the gold standard for proving causation, observational studies are prone to biases such as confounding
and measurement error. In this talk I will give an overview of meta-analysis of observational studies and draw parallels with sensitivity analysis
techniques and Bayesian analysis. I will motivate the discussion with the example of a meta-analysis of relationship between oral
contraceptive use and endometriosis.
The generalized linear model (glm) (McCullagh and Nelder, 1989) is a popular technique for modelling a wide variety of data and assumes that the observations are independent and that the conditional distribution of the response given the covariates belongs to a canonical exponential family. Robust procedures for generalized linear models have been considered among others by Stefanski et al. (1986), Kunsch et al. (1989), Bianco and Yohai (1996), Cantoni and Ronchetti (2001), Croux and Haesbroeck (2002) and Bianco et al. (2005). Recently, robust tests for the regression parameter under a logistic model were considered by Bianco and Martinez (2009).
In practice, some response variables may be missing, by design (as in two-stage studies) or by happenstance. The methods described above are designed for complete data sets and problems arise when missing observations are present. In this talk, we focus our attention on those cases in which missing data occur only in the responses. This situation is frequent in opinion polls, socio-economic investigations, medical studies and other scientific experiments where the explanatory variables can be controlled. In these studies outliers can also be present and so, robust procedures need to be considered.
We consider robust estimators for the regression parameter of a generalized linear model in order to build test statistics for this parameter when missing data occur in the responses. When there are no missing data, these estimators include the family of estimators previously studied by several authors such as Bianco and Yohai (1996), Cantoni and Ronchetti (2001), Croux and Haesbroeck (2002) and Bianco et al. (2005). The robust estimates are asymptotically normally distributed which allows to construct robust testing procedures. The asymptotic distribution of the test statistic under contiguous alternatives is also obtained. The sensitivity of the procedures to single outliers will be studied through their influence function, while the finite sample properties of the proposed procedure are investigated through a Monte Carlo study where the robust test is also compared with nonrobust alternatives.
Bianco, A., Garcia Ben, M. and Yohai, V. (2005).Robust estimation for linear regression with asymmetric errors. Canad. J. Statist., 33, 511-528.
Bianco, A. and Martinez, E. (2009). Robust testing in the logistic regression model. Comp. Statist. Data Anal., 53, 4095-4105.
Bianco, A. and Yohai, V. (1996). Robust estimation in the logistic regression model. Lecture Notes in Statistics, 109, 17-34. Springer-Verlag, New York.
Cantoni, E. and Ronchetti, E., 2001. Robust inference for generalized linear models. Journal of the American Statistical Association, 96, 1022--1030.
Croux, C. and Haesbroeck, G. (2003). Implementing the Bianco and Yohai estimator for logistic regression. Comp. Statist. Data Anal., 44, 273-295.
Kunsch, H., Stefanski, L. and Carroll, R. (1989). Conditionally unbiased bounded influence estimation in general regression models with applications to generalized linear models. J. Amer. Assoc., 84, 460-466.
Mc Cullagh, P. and Nelder, J.A. (1989). Generalized Linear Models, London: Chapman and Hall.
Stefanski, L., Carroll, R. and Ruppert, D. (1986). Bounded score functions for generalized linear models. Biometrika, 73, 413-424.
In a series of studies, we found that levels of lead below 10 micrograms per deciliter of whole blood – levels that are currently considered by the World
Health Organization and Health Canada to be protective for children -- were associated with diminished intellectual abilities in children. Indeed, the
lowest levels of exposure were associated with greater decrements in IQ scores. This presentation will review the evidence for a non-linear, dose-
response relationship of lead exposure with intellectual decrements and discuss the implications for policy.
I will talk about particle systems used to approximate conditional laws. I will present the classic system and something called the interacting Kalman filter. I will give some elements of proof of why this last system has a good performance uniformly in time.
"Celebrate Learning Week"
The goal of undergraduate instruction is to engender expert-like traits in the learners. Expert-like behaviour is described in general, along with how someone becomes expert in a given field. Suggestions are proposed as to which skills an undergraduate program in Statistics should promote and how instruction might best transform students into expert-like thinkers in the discipline.
Using a log link for binary response in generalized linear
mixed-effects models (GLMM) allows direct estimation of the relative
risk. If a random intercept is the only random effect in the
conditional mean structure, the marginal mean has the same
form. The fixed effects, representing the log relative risks, have the
same interpretation in both the mixed-effects model and the marginal
model. This leads to two approaches to estimate the relative risks, 1)
maximum likelihood for the mixed-effects models and 2) the generalized
estimating equations (GEE) approach for the marginal models.
In our study, we apply such log-linear models to assess the effects of
neutralizing antibodies on interferon beta-1b in relapsing-remitting
multiple sclerosis. The results obtained by the two approaches are
compared. The relative efficiency of the GEE approach and the
robustness of the GLMM approach to some forms of misspecification of
the model for the random effects are studied by simulations.
The major disadvantage with conventional spatial (Kriging) interpolation methodology is the fact that the claimed property of best linear unbiased prediction (BLUP) no longer holds when estimates of the spatial covariance parameters are plugged in. In my talk I report on recent work with my colleagues Gunter Spoeck and Hannes Kazianka in the area of Bayesian spatial prediction and design. The Bayesian approach not only offers more flexibility in modeling but also allows us to deal with uncertain covariance parameters, and it leads to more realistic estimates for the predicted variances.
We report on some experiences gained with our approach during a European project on "Automatic mapping of radioactivity in case of emergency". Moreover, I report on recent results on finding objective priors for the crucial nugget and range parameters of the widely used Matern-family of covariance functions. Finally, I will consider the problem of choosing an "optimal" spatial design, i.e. finding an optimal spatial configuration of the observation sites minimizing the mean squared error of prediction over an area of interest. Using Bessel-sine/cosine- expansions for random fields we arrive at a design problem which is equivalent to finding optimal Bayes designs for linear regreesion models with uncorrelated errors, for which powerful methods and algorithms from convex optimization theory are available.
A composite likelihood consists of a combination of valid likelihood objects. It is shown to be an good and practical alternative to the ordinary full likelihood when the full likelihood function is intractable, or difficult to evaluate due to complex dependencies. The resulting estimator enjoys desirable asymptotic properties such as consistency and asymptotic normality. In this talk we aim to compare performance of composite likelihood estimation relative to estimation based on full likelihood. Analytical and simulation results will be presented for different models. We will show that the composite likelihood approach is highly efficient, and for a few but important cases the composite likelihood is fully efficient with identical estimators compared to the full likelihood.
Natural resource problems typically must be modeled using data that is often incomplete, asynchronous and collected at different spatial and temporal scales with differing levels of measurement uncertainty. Both deterministic and stochastic models are widely applied in assessing environmental impacts, identifying risks and informing resource management decision-making for agricultural systems. However, existing models are high-dimensional, requiring extensive site-specific calibration, thereby limiting their spatial application. Likewise, simpler models, inevitably, must be combined to aid in more robust, integrative regional management or national policy-relevant decision-support. Using variable- and model-selection statistical techniques, one can identify models of intermediate complexity that can achieve appreciable reductions in parameter and structural uncertainty. In this way, such models may offer more reliable support to address a range of applications/problems and to identify critical thresholds and allocation trade-offs.
My talk will discuss several collaborative, inter-disciplinary projects that are investigating ways to improve the prediction and forecasting of crop production for food and energy in relation to water-use efficiency and climate variability across Canada.
I will highlight the use of wireless sensor monitoring network and satellite remote-sensing data. The talk will also showcase several national-scale, web-based decision-support systems currently in development. Here, the ability to refine and adapt models to take into account spatial and temporal-type operational constraints is of vital importance.
In many longitudinal studies, individual characteristics associated with their repeated measures may be covariates for the time to an event of interest. Thus, it is desirable to model both the survival process and the longitudinal process together. Statistical analysis may be complicated with missing data or measurement errors in the time-dependent covariates. This thesis considers a nonlinear mixed-effects model for the longitudinal process and the Cox proportional hazards model for the survival process. We provide a method based on the joint likelihood for nonignorable missing data, and we extend the method to the case of time-dependent covariates. We adapt a Monte Carlo EM algorithm to estimate the model parameters. We compare the method with the existing two-step method with some interesting findings. A real example from a recent HIV study is used as an illustration.
Nikoloulopoulos et al.  compared several bivariate copulas, such as t, BB1 and other copulas, to the GARCH (1,1) - filtered financial stock returns. They showed that the BB1 copula-GARCH model performed relatively better than others in terms of likelihood fit and extreme quantiles prediction. This project is conducted to test the assumption that the estimations provided by such parametric copula-GARCH model following the Maximum Likelihood method are not reliable. Therefore the model does not necessarily give a trustable measurement of tail dependence.
Longitudinal studies often contain several statistical issues, including longitudinal process and time-to-event process, the association among which requires joint modeling for unbiased estimation. The computation of the joint modeling, such as EM algorithm, might be extremely intensive and lead to convergence problems.
In this talk, we introduce an approximate likelihood-based inference method for jointly modeling longitudinal process and time-to-event process based on a NLME model and a parametric AFT model. By linearizing the joint model, we design a strategy for updating the random effects that connect two processes, and propose two frameworks for different scenarios of likelihood function. Both frameworks approximate the multidimensional integral in the observed-data joint likelihood by analytic expression, which greatly reduce the computational intensity of the complex joint modeling problem. The new method looks promising in terms of both estimation results and computation efficiency, especially when more subjects are given.
Likelihood based statistical inferences have been advocated by generations of statisticians. As an alternative to the traditional parametric likelihood, empirical likelihood is appealing for its nonparametric setting and desirable asymptotic properties. In this thesis, we first review and investigate the asymptotic and finite-sample properties of the empirical likelihood, particularly its implication to the construction of the confidence regions for population mean. We then focus on the properties of the adjusted empirical likelihood. The adjusted empirical likelihood was introduced to overcome the shortcomings of the empirical likelihood when it is applied to statistical models specified through general estimating equations. We discover several finite-sample properties of the adjusted empirical likelihood mainly in its application to constructing confidence regions for population mean. One important discovery is that the original adjusted empirical likelihood gives a bounded likelihood ratio statistic. It may cause some problems when the sample size is not large enough or the nominal confidence level is too high. We propose a possible approach to modify the adjusted empirical likelihood so as to get an unbounded likelihood ratio statistic.
In several experimental and observational studies it may happen that the number of observed variables is very much larger than that of subjects. It can be proved that, for a given and fixed number of subjects, when the number of variables diverges and the noncentrality parameter of the underlying population distribution increases with respect to each added variable, then power of combination-based permutation tests based (Pesarin F., Salmaso L.: Permutation tests for complex data: theory, applications and software, Wiley) is monotonically increasing. When testing e. g. for the equality of two distributions in a two-sample problem with treatment effects presumed to act possibly on more than one aspect, different tests may be properly considered for testing for different features of a null hypothesis, leading to the multiple aspect testing issue. Two different aspects maybe therefore of interest: the location-aspect, based on the comparison of location indexes, and the distributional-aspect, based on the comparison of the empirical distribution functions. Combination-based tests allows the experimenter for efficient multi-aspect testing also in presence of mixed variables and missing data. Some application examples from biomedical observational studies along with a demonstration of standalone software NPC Test will be discussed. In particular such applications will cover, among others, repeated measures designs in ophthalmology and shape analysis. Main focus will be on repeated measures designs and longitudinal surveys with mixed variables and/or missing data.
We introduce the m-dependent approximation, an effective approximation method, for a more general class of stationary processes. As its applications, under quite easy verifiable and more weaker conditions, we present some limit theorems for strong invariance principle, the maximum of the periodogram and spectral density estimation.
The pair-copula construction method can be used to build flexible multivariate distributions. This class includes drawable (D), canonical (C) and regular vines developed (see for example Kurowicka and Cooke (2006)). The multivariate distribution is build by using only bivariate copulas, which can be identified as as specific conditional and unconditional bivariate margins. This flexible class is very useful for applications in finance and allows for non-Gaussian dependency structures (see Aas et. al (2009), Czado (2009) and Min and Czado (2010)). I will discuss estimation and model selection methods and give applications to multivariate financial time series to illustrate the potential of these model classes.
1. Aas, K , Czado, C. , Frigessi A. and Bakken, H. (2009) Pair-copula constructions of multiple dependence, Insurance, Mathematics and Economics, 44, 182-198.
2. Czado, C. (2009) Pair-copula constructions of multivariate copulas, preprint.
3. Kurowicka, D. and Cooke, R.M. (2006) Uncertainty analysis with high dimensional dependence modelling, Wiley& Sons, Chichester.
4. Min, A. and Czado, C. (2010) SCOMDY models based on pair-copula constructions with application to exchange rates, preprint.
My talk will begin with a description of the `sinh-arcsinh'
transformation, which will then be used to define the sinh-arcsinh family
of distributions. When the base generating distribution is standard
normal, the `sinh-arcsinhed normal' (SASN) class of distributions is
obtained. This class contains symmetric as well as asymmetric members
and allows for tailweights that are heavier or lighter than those of the
normal distribution. As will be shown, the SASN class is highly tractable
and has many appealing properties. Likelihood based inference for it will
also be considered and applied in the analysis of real data. Finally,
the options used within the sinh-arcsinh formulation, as well as its
extension, will be discussed.
Three teams of graduate students from UBC participated in a case study poster competition at the SSC 2010 conference. Following a brief introduction of the case study, each team will discuss how they approached the problem, and their experience / hardships with the data analysis and poster creation. A brief description of the case study is included here:
Angiotensin I-converting enzyme (ACE) inhibitors are an important class of drugs in use for nearly 30 years in the treatment of cardiovascular diseases, such as hypertension and congestive heart failure, for example. Despite being effective pharmaceutical agents, these drugs have side effects. These serious side effects have been attributed to bradykinin (a pro-inflammatory peptide and potent vasodilator) which has a short half-life that is rapidly inactivated in plasma by two exopeptidases, ACE and aminopeptidase P (APP). Bradykinin is also transformed by carboypeptidase N (CPN) and the active metabolite des-Arg9-bradykinin (ARG) which in turn is inactivated by ACE and APP. Even though potentially deadly side effects have been attributed to bradykinin, there is no experimental evidence. Consequently, the primary objective of this case study is to characterize the activation metabolism of bradykinin and des-Arg9-bradykinin in plasma
and their role in angiooedema.
(A full copy of the case study is available at http://www.ssc.ca/en/education/archived-case-studies/ssc-case-studies-2010-metabolism-of-bradykinin-and-endogenous)
One of the main challenges in todays data rich environment is finding useful methods for data integration. I will show examples of analyses of data from the biological world where the difficulties arise from the heterogeneity of the data involved. I will show examples of combining trees, graphs or spatial data using distances and kernels.
With the generalization of sources of information that generate sustained high volumes of data, there has recently been a renewed interest in online (or recursive) estimation for various statistical models. In this talk, I consider a version of the Expectation-Maximization (EM) algorithm that can be used for online estimation in latent data models with independent observations. The general principle of the approach is to use a stochastic approximation scheme, in the domain of sufficient statistics, as a proxy for a limiting EM recursion. Depending on time and interest, I may also discuss the merit of this approach when used for batch estimation as well as ongoing work to extend the method to the case of hidden Markov models.
(Joint work with Eric Moulines)
Evidence from randomized trials is considered mandatory for assessing effects of drugs and other therapies and there is a general consensus regarding the central concepts involved in state of the art randomized trial. Nevertheless, some confusion arises because interpretation of randomized trials requires both methodologic and substantive expertise. This interactive session uses examples from the clinical literature to discuss concepts that are central to the interpretation of randomized trials. Specifically, the session highlights how features of the randomized, such as randomization/stratification, blinding and use of placebo or sham therapy serves to ensure a comparability of populations, effects and information.
We aim to study the response of timings of phenological events, such as bud-bursting, blooming, and fruiting, to climate variables, especially daily average temperatures, and to predict future phenological events. The timing of a phenological event a special type of time-to-event data, and daily average temperature is a time-varying covariate associated with it. Traditional models in survival analysis that are frequently used for dealing with time-dependent covariates are the Cox model and parametric proportional hazards models. However, these models encounter difficulties in our context. The Cox model is not efficient when there is a obvious trend in covariates and it is not generally suitable for prediction. At the same time the proportional hazards models involve complicated integration without a closed-form solution when complicated time-dependent covariates are present. Also, they usually require quite strong distributional assumptions. We developed a stochastic process based regression model for phenological data. Compared with the Cox model, this model is more efficient by using all the time-dependent covariate information, and is suitable for making predictions. Compared with parametric proportional hazards model, the fitting of this model is computationally less demanding, and this model is less restrictive on assumptions. With some extra mild assumptions, this model can be easily extended to incorporate sequential events as responses. It may also be useful for a broad range of survival data in medical study. The application of our model to the bloom dates data in the Okanagan region of British Columbia shows that our model makes sense!
Consider a multi-center randomized clinical trial. Should the analysis be guided by the design of the trial? Most investigators would answer in the affirmative. Yet in practice the design and many important features of most trials are ignored in the analysis. Most analyses of clinical trials assume that the trial has a random sample of patients from some well defined population. This is the basic assumption of most statistical methods employed to analyze trials in which the inference is targeted at drawing conclusions from a well defined population. In truth there is no random sample of patients nor is there a well defined population of patients. The patients in a trial can best be described as a “collection” which is defined as the complement of a random sample. Conclusion --- most randomized clinical trials are analyzed incorrectly as the basic assumption of a random sample is not true. However the basis of the inference can rely on the randomization process. Analytical techniques can be derived which depend only on the randomization process. However the resulting inference will be a “local” inference in that it will only apply to the patients who have entered the study. Another basic tenet in any analysis is to take into account factors that affect the outcome. Many multi-center trials have large institutional variation. This is especially true in drug trials where the institution’s patient management and support may influence the observed toxicity. However efficiently accounting for institutional variation may be difficult as many multi- center trials may have large numbers of centers, who typically enter a small number of patients. In this lecture, methods will be described for making inferences which only rely on the randomization process, but which also account for institutional variation. These methods have been adapted to account for permuted blocks which are typically used to design the randomization allocation in many trials. The methods generally result in greater power when compared to statistical methods which tend to ignore both institutional variation and permuted blocks. The methods have been adapted to group sequential trials.
In the past decades, we have witnessed the revolution of information technology. Its impact to statistical research is enormous. This talk attempts to address recent developments and some potential research issues in Business, Industry and Government (BIG) Statistics, with special focus on computer experiment and information systems. An overall introduction and review will be given, followed by specific research potentials. For each subject, the problem will be introduced,
some initial results will be presented, and future research problems will be suggested. If time permits, I will also discuss some recent advances in Search Engine and RFID study.
Slides of his talk can be downloaded at the website http://www.personal.psu.edu/users/j/x/jxz203/lin/Lin_pub/
Traditional approaches to statistical disease cluster detection focus on the identification of geographic areas with high numbers of incident or prevalent cases of disease. Events related to disease may be more appropriate for analysis in some contexts. I compare these approaches when the detection of aggregations of cases or events is conducted by testing individual administrative areas that may be combined with their nearest neighbours. The population and cases or events per case for each area as well as a nearest neighbour spatial relationship are required. I also investigate the power of the tests when implementing a testing algorithm. The methodology is illustrated on presentations to emergency departments.
First I will present an overview of my experience as a co-op student in the Lethbridge Research Center at Agriculture and Agri-Food Canada. Then, I will focus on a particular project in which I was involved. This project consists of performing a sensitivity analysis on an ecosystem model. The aim of this analysis is to identify which inputs of the Biome-BGC ecosystem model explain the variability of two outputs, soil moisture and plant productivity, for spring wheat in North America. There are several methods that can be used to perform a sensitivity analysis. The analysis presented is done using Sobol’s method, which is implemented by the SIMLAB software.
Many applied statisticians are involved in research that will provide a body of literature necessary in the development of public policy. However, there are often certain pressures that distort this body of knowledge. One major issue is the lack of publication for studies which do not obtain statistically significant results, despite that, through meta-analysis, they could help contribute to an overall statistically significant conclusion. However, it is still common that research sponsored by entities with a financial interest in achieving favourable results is more likely to actually report favourable results, and similarly, research showing the opposite is often strongly attacked and discredited. In this seminar, I focus on a few examples pertaining to drug development, product regulation, and climate research, and how this can effect the involved statisticians.
In this talk, I will share my co-op experiences working at St. Paul's Hospital and at Oxford Outcomes and talk about my decision to pursue the co-op option. This talk will also include a detailed look into three projects I was heavily involved in, during my 8-month work term: 1. a meta-analysis of immunosuppresant therapies post-transplant, 2. an investigation into the hip fracture incidence rates in British Columbia over the past decade and 3. a comprehensive literature review on global hip fracture rates.
1: Department of Environmental and Occupational Health, School of Public Health, Drexel University, Philadelphia, PA, USA
2: Community and Occupational Medicine Program, Department of Medicine, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Alberta, Canada
Background: The autism spectrum disorders (ASD) are a group of rare impairments of neurodevelopment that manifest prior to 3 years of age and are associated with impaired verbal and non-verbal communication and social interaction, and restricted and repetitive patterns of behaviour. ASD reduces quality of life in affected children and their parents, and leads to extraordinary economic costs for society. A recent review limited to only high-quality articles, on the basis of circumstantial epidemiological evidence, advanced a hypothesis that foetal hypoxia is implicated in ASD, but this hypothesis was never tested directly. There is some data to suggest that foetal hypoxia is likely to affect boys, but not girls.
Study design: Provincial delivery records (PDR) identified the cohort of 218,890 singleton live births in the province of Alberta, Canada, between 01/01/98 and 31/12/04. These were followed-up for ASD via ICD-9 diagnostic codes assigned by physician billing until 31/03/08. Maternal and obstetric risk factors, as well as measures of foetal hypoxia, were extracted from PDR.
Statistical challenges:  Estimates of prevalence of ASD varied from 3/1000 (two services by any combination of psychiatrist or paediatrician) to 5.2/1000 (one claim by any physician). Actual time of onset of ASD is unknown, but can precede diagnosis by quite some time. Therefore, outcome misclassification is likely and there is no gold standard, such records of assessments from specialized clinics, although we can guess sensitivity and specificity from a similar Canadian study.  Foetal hypoxia (exposure) was measured using 3 different tests and not all 3 tests were performed on all subjects tested. These tests are measured on continuous scale and are dichotomized on the basis of clinical guidelines. Therefore we have measurement error problem aggravated by dichotomization of miss-measured variable that can produce non-ignorable differential exposure misclassification.  For half of the subjects, test of hypoxia was not performed (deemed to be very unlikely to be positive?). Therefore, there is severe missingness that is likely to fail to meet missing-at-random (MAR) assumption.  ASD is a very rare outcome, leading to zero-inflation.
Some results: We ignored complications ,  and , but applied Estimation-Maximization (EM) algorithm to problem , modeling probability of exposure among missing values using suspected covariates of foetal hypoxia such as low Apgar score, C-section, low birth weight, etc. Simple correction for deviation from MAR assumption was attempted in sensitivity analysis. Compared to complete-case analysis, EM algorithm resulted in gain of precision and borderline “significant” effect in expected direction among boys. Further adjustment for even small deviation from MAR assumption, dramatically alters inference (if one follows traditions of biomedical literature) about effect of foetal hypoxia on ASD risk among full-term boys, supporting a priori hypothesis. Apparently an important result, but can it be trusted if we have been naïve about uncertainty associated with ignoring all other challenges posed by the data?
Asymptotic independence of the components of random vectors is a concept used in many applications. The standard criteria for checking asymptotic independence are given in terms of distribution functions (dfs). Dfs are rarely available in an explicit form, especially in the multivariate case. Often we are given the form of the density or, via the shape of the data clouds, one can obtain a good geometric image of the asymptotic shape of the level sets of the density. In the talk, a simple sufficient condition for asymptotic independence in terms of this asymptotic shape for light-tailed densities will be presented. This condition extends Sibuya's classic result on asymptotic independence for Gaussian densities.
Both linguistics and biology face scientific questions that require reconstructing the ancestral forms of discrete sequences from their modern descendants. In linguistics, these questions are about the words that appeared in the protolanguages from which modern languages evolved. Linguists painstakingly reconstruct these words by hand using knowledge of the relationships between languages and the plausibility of sound changes. In biology, analogous questions concern the DNA, RNA, or protein sequences of ancestral organisms. By reconstructing ancestral sequences and the evolutionary paths between them, biologists can make inferences about the evolution of gene function and the nature of the environment in which they evolved.
In this talk, I will give an overview of the main challenges in the field, and show how we addressed two critical difficulties encountered in previous approaches. The first difficulty comes from the need to fit rate matrices and birth-death parameters of Continuous Time Markov Chains (CTMCs), and obtaining marginals from these CTMCs for different branch lengths. While these operations can be easily done in pure substitution models, the equivalent task in all but the simplest InDel models is highly non-trivial. The second difficulty comes from the need to evaluate partition functions and take expectations over the exceedingly large space of evolutionary derivations.
I will also present an application to gappy multiple sequence alignment, and a new characterization of sound change obtained from the model.
We present a penalized matrix decomposition, a new framework for computing a low-rank approximation for a matrix. This low-rank approximation is a generalization of the singular value decomposition. While the singular value decomposition usually yields singular vectors that have no elements that are exactly equal to zero, our new decomposition results in sparse singular vectors. When this decomposition is applied to a data matrix, it can yield interpretable results. Moreover, when applied to a dissimilarity matrix, this leads to a method for sparse hierarchical clustering, which allows for the clustering of a set of observations using an adaptively-chosen subset of the features. These methods are demonstrated on the Netflix data and on a genomic data set.
This is joint work with Robert Tibshirani and Trevor Hastie.
Many modern statistical applications involve noisy observations of an underlying process that can best
be described by a complex deterministic system. In fields such as astronomy, astrophysics and the
environmental sciences, these systems often involve the solution of partial differential equations that
represent the best available understanding of the physical processes. Statistical computation in this
context is typically hampered by either look-up tables or expensive “black-box” function evaluations.
We present an example from astrophysics with a “look-up table likelihood”: the analysis of stellar
populations. Astrophysicists have developed sophisticated models describing how intrinsic physical
properties of stars relate to observed photometric data. The mapping between the parameters and the
data-space cannot be solved analytically and is represented as a series of look-up tables. We present a
flexible hierarchical model for analyzing stellar populations. Our computational framework is
applicable to many "black-box" settings, and robust to the structure of the black-box. The performance
of various sampling schemes will be presented, together with the results for an Astronomical dataset.
This is joint work with Xiao-Li Meng, Andreas Zezas and Vinay Kashyap.