Lecture 5: Model choice
Instructor: Alexandre Bouchard-Côté
Editor: TBA
Examples
- Choice in the likelihood: should we replace the logit transform in the Challenger example by a probit transformation?
- Choice in the prior: in the challenger example, should we replace the normal distribution by a t distribution?
- Variable (covariate) selection: should we use only the temperature covariate, or both temperature and humidity, or only humidity, to predict o-ring failure?
- Determining the number of clusters in mixture models.
- Example: cancer heterogeneity and bulk sampling.
- Issue: observed vs. population cluster count.
Notation
- I: an index over a discrete set of models.
- Zi for i∈I: latent space for model i.
- pi, ℓi, mi: prior, likelihood, and marginal likelihood densities for model i.
Key idea
Put a prior p on I, and make the uncertainty over models part of the probabilistic model.
The new joint probability density is given by:
p((i,z),x)=p(i)pi(z)ℓi(x|z),
where (i,z) is a member of a new latent space given by:
Z=⋃i∈I({i}×Zi),
Notation: denote the event that model i is the model explaining the data by Mi.
Outcome: Using this construction, model choice can in principle be approached using the same methods as those used last week.
Example: (TODO)
- Model selection with 0-1 loss.
- Model averaging for prediction.
Observations
Graphical modelling: Equation (2) cannot be directly expressed as a non-trivial graphical model (since it is not a product space). How to transform it into a graphical model?
Non-regularity: even with the reductions introduced so far, model selection justifies special attentions because of non-regularities: the likelihood depends in a non-smooth way upon the model indicator variable. Importantly, different models have different dimensionality of the latent space. We will see that MCMC then requires special techniques called trans-dimensional MCMC.
Bayes factor
Ratio of the marginal likelihood for two models:
B12=m1(x)m2(x)
Values of B12 greater than 1.0 favor model #1 over #2. Values smaller than 1.0 favor #2 over #1.
This is just a reparameterization of the Bayes estimator with an asymmetric 0-1 loss. Note that it is different from a likelihood ratio:
supz1ℓ1(x|z1)supz2ℓ2(x|z2),
which does not arise within the Bayesian framework.
Computation of marginal likelihoods and Bayes factors
Conjugate models
Recall our notation:
- h is a hyper-parameter for the prior, ph(z).
- Conjugacy means that the posterior density coincides with the prior for updated hyperpameters u(x,h): ph(z|x)=pu(x,h)(z).
Rearranging Bayes rule:
m(x)=ph(z)ℓ(x|z)p(z|x)=ph(z)ℓ(x|z)pu(x,h)(z).
Since this is true for all z, we can pick an arbitrary z0, and evaluate each component of the right-hand side by assumption.
Example: Poisson process on two regions.
Pro: exact.
Con: only possible for tractable conjugate families.
Model saturation
The idea is to build an augmented model, which can be written as a graphical model (in contrast to Equation (2), and from which we can still approximate mi(x).
Construction of the auxiliary latent space:
- Instead of defining the global latent space as a union of each model's latent space, define it as a product space,
- and add to that an indicator μ that selects which model to use to explain the data. The event M1 corresponds to μ=1 and M2, to μ=2.
This creates the following auxiliary latent space:
Z′={1,2}×Z1×Z2.
Example: dim(Z1) = 1, dim(Z2) = 2. What is a picture for Z′? Contrast with Z from Equation (2).
Construction of the auxiliary joint distribution: suppose the current state is (μ,z1,z2,x). We need to define an auxiliary joint density ˜p(μ,z1,z2,y).
The idea is that when μ=1, we explain the data x using z1, and when μ=2, we explain the data x using z2.
In notation, if μ=1, we set:
˜p(μ,z1,z2,x)=p(μ)p1(z1)p2(z2)ℓ1(x|z1),
and if μ=2,
˜p(μ,z1,z2,x)=p(μ)p1(z1)p2(z2)ℓ2(x|z2).
Exercise: show that the marginal ˜p(μ|x) can be used to obtain the Bayes factor:
˜p(1|x)˜p(2|x)p(2)p(1)=B12
Pro: can use existing MCMC methods such as JAGS/Stan without changing the MCMC code.
Cons:
- Limited to finite collections of models, |I|<∞.
- Can be slow.