Lecture 5: Model choice

08 Mar 2015

Instructor: Alexandre Bouchard-Côté
Editor: TBA

Examples

  • Choice in the likelihood: should we replace the logit transform in the Challenger example by a probit transformation?
  • Choice in the prior: in the challenger example, should we replace the normal distribution by a t distribution?
  • Variable (covariate) selection: should we use only the temperature covariate, or both temperature and humidity, or only humidity, to predict o-ring failure?
  • Determining the number of clusters in mixture models.
    • Example: cancer heterogeneity and bulk sampling.
    • Issue: observed vs. population cluster count.

Notation

  • I: an index over a discrete set of models.
  • Zi for iI: latent space for model i.
  • pi, i, mi: prior, likelihood, and marginal likelihood densities for model i.

Key idea

Put a prior p on I, and make the uncertainty over models part of the probabilistic model.

The new joint probability density is given by:

p((i,z),x)=p(i)pi(z)i(x|z),

where (i,z) is a member of a new latent space given by:

Z=iI({i}×Zi),

Notation: denote the event that model i is the model explaining the data by Mi.

Outcome: Using this construction, model choice can in principle be approached using the same methods as those used last week.

Example: (TODO)

  • Model selection with 0-1 loss.
  • Model averaging for prediction.

Observations

Graphical modelling: Equation (2) cannot be directly expressed as a non-trivial graphical model (since it is not a product space). How to transform it into a graphical model?

Non-regularity: even with the reductions introduced so far, model selection justifies special attentions because of non-regularities: the likelihood depends in a non-smooth way upon the model indicator variable. Importantly, different models have different dimensionality of the latent space. We will see that MCMC then requires special techniques called trans-dimensional MCMC.

Bayes factor

Ratio of the marginal likelihood for two models:

B12=m1(x)m2(x)

Values of B12 greater than 1.0 favor model #1 over #2. Values smaller than 1.0 favor #2 over #1.

This is just a reparameterization of the Bayes estimator with an asymmetric 0-1 loss. Note that it is different from a likelihood ratio:

supz11(x|z1)supz22(x|z2),

which does not arise within the Bayesian framework.

Computation of marginal likelihoods and Bayes factors

Conjugate models

Recall our notation:

  • h is a hyper-parameter for the prior, ph(z).
  • Conjugacy means that the posterior density coincides with the prior for updated hyperpameters u(x,h): ph(z|x)=pu(x,h)(z).

Rearranging Bayes rule:

m(x)=ph(z)(x|z)p(z|x)=ph(z)(x|z)pu(x,h)(z).

Since this is true for all z, we can pick an arbitrary z0, and evaluate each component of the right-hand side by assumption.

Example: Poisson process on two regions.

Pro: exact.

Con: only possible for tractable conjugate families.

Model saturation

The idea is to build an augmented model, which can be written as a graphical model (in contrast to Equation (2), and from which we can still approximate mi(x).

Construction of the auxiliary latent space:

  • Instead of defining the global latent space as a union of each model's latent space, define it as a product space,
  • and add to that an indicator μ that selects which model to use to explain the data. The event M1 corresponds to μ=1 and M2, to μ=2.

This creates the following auxiliary latent space:

Z={1,2}×Z1×Z2.

Example: dim(Z1) = 1, dim(Z2) = 2. What is a picture for Z? Contrast with Z from Equation (2).

Construction of the auxiliary joint distribution: suppose the current state is (μ,z1,z2,x). We need to define an auxiliary joint density ˜p(μ,z1,z2,y).

The idea is that when μ=1, we explain the data x using z1, and when μ=2, we explain the data x using z2.

In notation, if μ=1, we set:

˜p(μ,z1,z2,x)=p(μ)p1(z1)p2(z2)1(x|z1),

and if μ=2,

˜p(μ,z1,z2,x)=p(μ)p1(z1)p2(z2)2(x|z2).

Exercise: show that the marginal ˜p(μ|x) can be used to obtain the Bayes factor:

˜p(1|x)˜p(2|x)p(2)p(1)=B12

Pro: can use existing MCMC methods such as JAGS/Stan without changing the MCMC code.

Cons:

  • Limited to finite collections of models, |I|<.
  • Can be slow.