Alexandre Bouchard-Côté
Data
Question: do the rabbits prefer one patch over the other?
Models:
\[ \begin{align*} \alpha &\sim \text{Exp}(0.1) \\ x_i | \alpha &\sim \text{Poi}(\alpha) \text{ for }i\in \{1,2\} \end{align*} \]
\[ \begin{align*} \beta_1 &\sim \text{Exp}(0.1) \\ \beta_2 &\sim \text{Exp}(0.1) \\ x_i | \beta_i &\sim \text{Poi}(\beta_i) \text{ for }i\in \{1,2\} \end{align*} \]
Note: this is a simple example of non-nested models (in the sense that we cannot set one of the parameters to zero to obtain the other one).
\[ m_i = \int p_i(z) \ell_i(z) {\text{d}}z \]
Notation warning: \(m\) is sometimes called \(Z\) in the Monte Carlo literature, but here \(Z\) will be used for the random variable corresponding to the latent variable \(z\).
Synonyms used for “marginal likelihood”:
Examples: first two models in Model_selection.html
Put a prior \(p\) on \(I\), and make the uncertainty over models part of the probabilistic model (let us use uniform priors for today).
The new joint probability density is given by:
\[ p((i, z), x) = p(i) p_i(z) \ell_i(x | z), \]
where \((i, z)\) is a member of a new latent space given by:
\[ {\mathscr{Z}}= \bigcup_{i\in I} \left( \{i\} \times {\mathscr{Z}}_i \right), \]
Notation: denote the event that model \(i\) is the model explaining the data by \(M_i\).
Outcome: Using this construction, model choice can in principle be approached using the same methods as those used in earlier lectures.
Graphical modelling: \({\mathscr{Z}}\) cannot be directly expressed as a non-trivial graphical model (since it is not a product space). How to transform it into a graphical model?
Non-regularity: even with the reductions introduced so far, model selection justifies special attentions because of non-regularities: the likelihood depends in a non-smooth way upon the model indicator variable. Importantly, different models have different dimensionality of the latent space. We will see that MCMC then requires special techniques called trans-dimensional MCMC.
Ratio of the marginal likelihood for two models:
\[ B_{12} = \frac{m_1(x)}{m_2(x)} \]
Values of \(B_{12}\) greater than 1.0 favor model #1 over #2. Values smaller than 1.0 favor #2 over #1. See wikipedia article for some conventions on how to describe the strength of evidence in text: https://en.wikipedia.org/wiki/Bayes_factor.
This is just a reparameterization of the Bayes estimator with an asymmetric 0-1 loss. Note that it is different from a likelihood ratio:
\[ \frac{\sup_{z_1} \ell_1(x|z_1)}{\sup_{z_2} \ell_2(x|z_2)}, \]
which is not used within the Bayesian framework.
Many approaches exist, with pros and cons, see for example 19 dubious ways to compute the marginal likelihood of a phylogenetic tree topology
I will outline some key methods:
The idea is to build an augmented model, which can be written as a graphical model, and from which we can still approximate \(m_i(x)\).
Construction of the auxiliary latent space:
This creates the following auxiliary latent space:
\[ {\mathscr{Z}}' = \{1, 2\} \times {\mathscr{Z}}_1 \times {\mathscr{Z}}_2. \]
Example: dim(\({\mathscr{Z}}_1\)) = 1, dim(\({\mathscr{Z}}_2\)) = 2, for the rabbit hole problem
Construction of the auxiliary joint distribution: suppose the current state is \((\mu, z_1, z_2, x)\). We need to define an auxiliary joint density \(\tilde p(\mu, z_1, z_2, x)\).
The idea is that when \(\mu = 1\), we explain the data \(x\) using \(z_1\), and when \(\mu = 2\), we explain the data \(x\) using \(z_2\).
In notation, if \(\mu = 1\), we set:
\[ \tilde p(\mu, z_1, z_2, x) = p(\mu) p_1(z_1) p_2(z_2) \ell_1(x | z_1), \]
and if \(\mu = 2\),
\[ \tilde p(\mu, z_1, z_2, x) = p(\mu) p_1(z_1) p_2(z_2) \ell_2(x | z_2). \]
Example: third model in Model_selection.html
Recall:
\[m_i = \int p_i(z) \ell_i(z) {\text{d}}z\]
Let us say \(\mu = 1\)
\[ \begin{align*} \tilde p(\mu|x) &\propto \tilde p(\mu, x) \\ &= \int_{z_1} \int_{z_2} \tilde p(\mu, z_1, z_2, x) \\ &= \int_{z_1} \int_{z_2} p(\mu) p_1(z_1) p_2(z_2) \ell_1(x|z_1) \\ &= p(\mu) \left( \int_{z_1} p_1(z_1) \ell_1(x|z_1) \right) \left( \int_{z_2} p_2(z_2) \right) \\ &= p(\mu) \times m_1 \times 1 \end{align*} \]
Stepping stone: from stepping stone, we obtained \(\log m_1 \approx -11.94\) and \(\log m_2 \approx -8.98\)
Model saturation: since \(\mu\) is Bernoulli, you can find an approximation of \(\tilde p(\mu|x)\) by looking at the posterior mean of the variable indicator
in the third model in Model_selection.html
Consider now the final exercise in Model_selection.html.
Pro: can use graphical-model based MCMC methods that do not have marginal likelihood computation methods.
Cons:
Chapter 7 in Model Choice, except for section 7.3.
trials,arms,groupSizes,numbersOfCases
AZ-Oxford (combined),vaccinated,5807,30
AZ-Oxford (combined),control,5829,101
Pfizer-BioNTech,vaccinated,18198,8
Pfizer-BioNTech,control,18325,162
Moderna-NIH,vaccinated,14134,11
Moderna-NIH,control,14073,185
~ Binomial(groupSize, incidence * (1.0 - efficacy))
where efficacy is 0.0 for control, and random for each vaccine--model.data data/vaccines/data.csv
model Vaccines {
param GlobalDataSource data
param Plate<String> trials
param Plate<String> arms
param Plated<IntVar> groupSizes
random Plated<IntVar> numbersOfCases
random Plated<RealVar> efficacies
random Plated<RealVar> incidences
random RealVar a ?: latentReal, b ?: latentReal, c ?: latentReal, d ?: latentReal
laws {
// fill this
}
}