Lecture 5: Model choice
Instructor: Alexandre Bouchard-Côté
Editor: TBA
Examples
- Choice in the likelihood: should we replace the logit transform in the Challenger example by a probit transformation?
- Choice in the prior: in the challenger example, should we replace the normal distribution by a t distribution?
- Variable (covariate) selection: should we use only the temperature covariate, or both temperature and humidity, or only humidity, to predict o-ring failure?
- Determining the number of clusters in mixture models.
- Example: cancer heterogeneity and bulk sampling.
- Issue: observed vs. population cluster count.
Notation
- $I$: an index over a discrete set of models.
- $\Zscr_i$ for $i\in I$: latent space for model $i$.
- $p_i$, $\ell_i$, $m_i$: prior, likelihood, and marginal likelihood densities for model $i$.
Key idea
Put a prior $p$ on $I$, and make the uncertainty over models part of the probabilistic model.
The new joint probability density is given by:
\begin{eqnarray} p((i, z), x) = p(i) p_i(z) \ell_i(x | z), \end{eqnarray}
where $(i, z)$ is a member of a new latent space given by:
\begin{eqnarray}\label{eq:new-latent-space} \Zscr = \bigcup_{i\in I} \left( \{i\} \times \Zscr_i \right), \end{eqnarray}
Notation: denote the event that model $i$ is the model explaining the data by $M_i$.
Outcome: Using this construction, model choice can in principle be approached using the same methods as those used last week.
Example: (TODO)
- Model selection with 0-1 loss.
- Model averaging for prediction.
Observations
Graphical modelling: Equation (\ref{eq:new-latent-space}) cannot be directly expressed as a non-trivial graphical model (since it is not a product space). How to transform it into a graphical model?
Non-regularity: even with the reductions introduced so far, model selection justifies special attentions because of non-regularities: the likelihood depends in a non-smooth way upon the model indicator variable. Importantly, different models have different dimensionality of the latent space. We will see that MCMC then requires special techniques called trans-dimensional MCMC.
Bayes factor
Ratio of the marginal likelihood for two models:
\begin{eqnarray}\label{eq:bayes-factor} B_{12} = \frac{m_1(x)}{m_2(x)} \end{eqnarray}
Values of $B_{12}$ greater than 1.0 favor model #1 over #2. Values smaller than 1.0 favor #2 over #1.
This is just a reparameterization of the Bayes estimator with an asymmetric 0-1 loss. Note that it is different from a likelihood ratio:
\begin{eqnarray}\label{eq:likelihood-ratio} \frac{\sup_{z_1} \ell_1(x|z_1)}{\sup_{z_2} \ell_2(x|z_2)}, \end{eqnarray}
which does not arise within the Bayesian framework.
Computation of marginal likelihoods and Bayes factors
Conjugate models
Recall our notation:
- $h$ is a hyper-parameter for the prior, $p_h(z)$.
- Conjugacy means that the posterior density coincides with the prior for updated hyperpameters $u(x, h)$: $p_h(z|x) = p_{u(x, h)}(z)$.
Rearranging Bayes rule:
\begin{eqnarray} m(x) & = & \frac{p_{h}(z) \ell(x | z)}{p(z | x)} \\ & = & \frac{p_{h}(z) \ell(x | z)}{p_{u(x, h)}(z)}. \end{eqnarray}
Since this is true for all $z$, we can pick an arbitrary $z_0$, and evaluate each component of the right-hand side by assumption.
Example: Poisson process on two regions.
Pro: exact.
Con: only possible for tractable conjugate families.
Model saturation
The idea is to build an augmented model, which can be written as a graphical model (in contrast to Equation (\ref{eq:new-latent-space}), and from which we can still approximate $m_i(x)$.
Construction of the auxiliary latent space:
- Instead of defining the global latent space as a union of each model's latent space, define it as a product space,
- and add to that an indicator $\mu$ that selects which model to use to explain the data. The event $M_1$ corresponds to $\mu = 1$ and $M_2$, to $\mu = 2$.
This creates the following auxiliary latent space:
\begin{eqnarray} \Zscr' = \{1, 2\} \times \Zscr_1 \times \Zscr_2. \end{eqnarray}
Example: dim($\Zscr_1$) = 1, dim($\Zscr_2$) = 2. What is a picture for $\Zscr'$? Contrast with $\Zscr$ from Equation (\ref{eq:new-latent-space}).
Construction of the auxiliary joint distribution: suppose the current state is $(\mu, z_1, z_2, x)$. We need to define an auxiliary joint density $\tilde p(\mu, z_1, z_2, y)$.
The idea is that when $\mu = 1$, we explain the data $x$ using $z_1$, and when $\mu = 2$, we explain the data $x$ using $z_2$.
In notation, if $\mu = 1$, we set:
\begin{eqnarray} \tilde p(\mu, z_1, z_2, x) = p(\mu) p_1(z_1) p_2(z_2) \ell_1(x | z_1), \end{eqnarray}
and if $\mu = 2$,
\begin{eqnarray} \tilde p(\mu, z_1, z_2, x) = p(\mu) p_1(z_1) p_2(z_2) \ell_2(x | z_2). \end{eqnarray}
Exercise: show that the marginal $\tilde p(\mu | x)$ can be used to obtain the Bayes factor:
\begin{eqnarray} \frac{\tilde p(1 | x)}{\tilde p(2 | x)} \frac{p(2)}{p(1)} = B_{12} \end{eqnarray}
Pro: can use existing MCMC methods such as JAGS/Stan without changing the MCMC code.
Cons:
- Limited to finite collections of models, $|I| < \infty$.
- Can be slow.