Rabbit holes example

Data

One patch of land of 1 acre has \(x_1=12\) rabbit holes.
A second, disjoint patch of 1 acre has \(x_2=32\) holes.

Question: do the rabbits prefer one patch over the other?

Models:

First model: no preference

\[ \begin{align*} \alpha &\sim \text{Exp}(0.1) \\ x_i | \alpha &\sim \text{Poi}(\alpha) \text{ for }i\in \{1,2\} \end{align*} \]

Second model: different preferences

\[ \begin{align*} \beta_1 &\sim \text{Exp}(0.1) \\ \beta_2 &\sim \text{Exp}(0.1) \\ x_i | \beta_i &\sim \text{Poi}(\beta_i) \text{ for }i\in \{1,2\} \end{align*} \]

Note: this is a simple example of non-nested models (in the sense that we cannot set one of the parameters to zero to obtain the other one).

Model selection notation

\(I\): an index over a discrete set of models.
\({\mathscr{Z}}_i\) for \(i\in I\): latent space for model \(i\), i.e. \(z \in {\mathscr{Z}}_i\) when model \(i\) is selected.
\(p_i\), \(\ell_i\), \(m_i\): prior, likelihood, and marginal likelihood densities for model \(i\), i.e.

\[ m_i = \int p_i(z) \ell_i(z) {\text{d}}z \]

Notation warning: \(m\) is sometimes called \(Z\) in the Monte Carlo literature, but here \(Z\) will be used for the random variable corresponding to the latent variable \(z\).

Synonyms used for “marginal likelihood”:

Evidence
Probability of the data
Normalization constant

Computing marginal likehood using PPL

Examples: first two models in Model_selection.html

Poll: estimate \(m_2/m_1\)

3.5
0.05
19.3
0.29
None of the above

Key idea for Bayesian model selection

Put a prior \(p\) on \(I\), and make the uncertainty over models part of the probabilistic model (let us use uniform priors for today).

The new joint probability density is given by:

\[ p((i, z), x) = p(i) p_i(z) \ell_i(x | z), \]

where \((i, z)\) is a member of a new latent space given by:

\[ {\mathscr{Z}}= \bigcup_{i\in I} \left( \{i\} \times {\mathscr{Z}}_i \right), \]

Notation: denote the event that model \(i\) is the model explaining the data by \(M_i\).

Outcome: Using this construction, model choice can in principle be approached using the same methods as those used in earlier lectures.

Observations

Graphical modelling: \({\mathscr{Z}}\) cannot be directly expressed as a non-trivial graphical model (since it is not a product space). How to transform it into a graphical model?

Non-regularity: even with the reductions introduced so far, model selection justifies special attentions because of non-regularities: the likelihood depends in a non-smooth way upon the model indicator variable. Importantly, different models have different dimensionality of the latent space. We will see that MCMC then requires special techniques called trans-dimensional MCMC.

Bayes factor

Ratio of the marginal likelihood for two models:

\[ B_{12} = \frac{m_1(x)}{m_2(x)} \]

Values of \(B_{12}\) greater than 1.0 favor model #1 over #2. Values smaller than 1.0 favor #2 over #1. See wikipedia article for some conventions on how to describe the strength of evidence in text: https://en.wikipedia.org/wiki/Bayes_factor.

This is just a reparameterization of the Bayes estimator with an asymmetric 0-1 loss. Note that it is different from a likelihood ratio:

\[ \frac{\sup_{z_1} \ell_1(x|z_1)}{\sup_{z_2} \ell_2(x|z_2)}, \]

which is not used within the Bayesian framework.

Computation of Bayes factor

In standard Bayesian problems, we need a list of MCMC samples, each of which might be high dimensional.
Here we only need to compute one scalar for each model, that should be easy… right?
Surprisingly, this is generally considered harder than getting samples from the posterior distribution.

Many approaches exist, with pros and cons, see for example 19 dubious ways to compute the marginal likelihood of a phylogenetic tree topology

I will outline some key methods:

Model saturation (today)
Thermodynamic integration and stepping stone (that’s what you used earlier)
Sequential Monte Carlo via change of measure (later, if time permits)
Reversible Jump MCMC (later)

Model saturation

The idea is to build an augmented model, which can be written as a graphical model, and from which we can still approximate \(m_i(x)\).

Construction of the auxiliary latent space:

Instead of defining the global latent space as a union of each model’s latent space, define it as a product space,
and add to that an indicator \(\mu\) that selects which model to use to explain the data. The event \(M_1\) corresponds to \(\mu = 1\) and \(M_2\), to \(\mu = 2\).

This creates the following auxiliary latent space:

\[ {\mathscr{Z}}' = \{1, 2\} \times {\mathscr{Z}}_1 \times {\mathscr{Z}}_2. \]

Example: dim(\({\mathscr{Z}}_1\)) = 1, dim(\({\mathscr{Z}}_2\)) = 2, for the rabbit hole problem

Construction of the auxiliary joint distribution: suppose the current state is \((\mu, z_1, z_2, x)\). We need to define an auxiliary joint density \(\tilde p(\mu, z_1, z_2, x)\).

The idea is that when \(\mu = 1\), we explain the data \(x\) using \(z_1\), and when \(\mu = 2\), we explain the data \(x\) using \(z_2\).

In notation, if \(\mu = 1\), we set:

\[ \tilde p(\mu, z_1, z_2, x) = p(\mu) p_1(z_1) p_2(z_2) \ell_1(x | z_1), \]

and if \(\mu = 2\),

\[ \tilde p(\mu, z_1, z_2, x) = p(\mu) p_1(z_1) p_2(z_2) \ell_2(x | z_2). \]

Example: third model in Model_selection.html

Recall:

\[m_i = \int p_i(z) \ell_i(z) {\text{d}}z\]

\(\tilde p(\mu|x) \ge m_{\mu}\)
\(\tilde p(\mu|x) \le m_{\mu}\)
\(\tilde p(\mu|x) = m_{\mu}\)
\(\tilde p(\mu|x) \propto p(\mu) m_{\mu}\)
None of the above

Derivation

Let us say \(\mu = 1\)

\[ \begin{align*} \tilde p(\mu|x) &\propto \tilde p(\mu, x) \\ &= \int_{z_1} \int_{z_2} \tilde p(\mu, z_1, z_2, x) \\ &= \int_{z_1} \int_{z_2} p(\mu) p_1(z_1) p_2(z_2) \ell_1(x|z_1) \\ &= p(\mu) \left( \int_{z_1} p_1(z_1) \ell_1(x|z_1) \right) \left( \int_{z_2} p_2(z_2) \right) \\ &= p(\mu) \times m_1 \times 1 \end{align*} \]

Stepping stone: from stepping stone, we obtained \(\log m_1 \approx -11.94\) and \(\log m_2 \approx -8.98\)

Model saturation: since \(\mu\) is Bernoulli, you can find an approximation of \(\tilde p(\mu|x)\) by looking at the posterior mean of the variable indicator in the third model in Model_selection.html

Poll: what is the estimate of \(m_2/m_1\) from model saturation?

0.95
0.05
19
1/19
None of the above

Implicit penalty for model complexity

Consider now the final exercise in Model_selection.html.

Poll: what model is selected if the observations are \((12, 13)\)?

Model 1 (simpler one)
Model 2 (more complicated one)

Implicit penalty for model complexity

Old plot for \((12, 32)\)

drawing

New plot for \((12, 13)\)

drawing

Model saturation, pros and cons

Pro: can use graphical-model based MCMC methods that do not have marginal likelihood computation methods.

Cons:

Limited to finite collections of models, \(|I| < \infty\).
Computational cost does not scale well in \(|I|\) (but tricks can relax this, consider for example spike-and-slab example, where an exponential family of models can be efficiently represented by “sharing” latent variables across models)

Readings

Chapter 7 in Model Choice, except for section 7.3.

Optional exercise

Apply the material we have seen so far, in particular, PPLs, hierarchical models, and model selection, to estimate the vaccine efficacy of the following 3 trials:

trials,arms,groupSizes,numbersOfCases
AZ-Oxford (combined),vaccinated,5807,30
AZ-Oxford (combined),control,5829,101
Pfizer-BioNTech,vaccinated,18198,8
Pfizer-BioNTech,control,18325,162
Moderna-NIH,vaccinated,14134,11
Moderna-NIH,control,14073,185

Hints:
- use a model of the form ~ Binomial(groupSize, incidence * (1.0 - efficacy)) where efficacy is 0.0 for control, and random for each vaccine
- use the following template below, so that data is imported automatically when using the command line arguments --model.data data/vaccines/data.csv

model Vaccines {
  
  param GlobalDataSource data
  
  param Plate<String> trials
  param Plate<String> arms
  param Plated<IntVar> groupSizes
  random Plated<IntVar> numbersOfCases
  
  random Plated<RealVar> efficacies
  random Plated<RealVar> incidences
  
  random RealVar a ?: latentReal, b ?: latentReal, c ?: latentReal, d ?: latentReal
  
  laws {
    
    // fill this
    
  }
}

Basics of model selection

Examples of model selection

Rabbit holes example

Model selection notation

Computing marginal likehood using PPL

Poll: estimate \(m_2/m_1\)

Key idea for Bayesian model selection

Observations

Bayes factor

Computation of Bayes factor

Model saturation

Derivation

Poll: what is the estimate of \(m_2/m_1\) from model saturation?

Implicit penalty for model complexity

Poll: what model is selected if the observations are \((12, 13)\)?

Implicit penalty for model complexity

Model saturation, pros and cons

Readings

Optional exercise

Basics of model selection

Examples of model selection

Rabbit holes example

Model selection notation

Computing marginal likehood using PPL

Poll: estimate \(m_2/m_1\)

Key idea for Bayesian model selection

Observations

Bayes factor

Computation of Bayes factor

Model saturation

Poll: how is \(\tilde p(\mu|x)\) related to the marginal likehood?

Derivation

Poll: what is the estimate of \(m_2/m_1\) from model saturation?

Implicit penalty for model complexity

Poll: what model is selected if the observations are \((12, 13)\)?

Implicit penalty for model complexity

Model saturation, pros and cons

Readings

Optional exercise