Alexandre Bouchard-Côté

**Goal** computing

\[Z = \int \gamma(x) {\text{d}}x\]

**Main motivation in Bayesian statistics** comparing models via Bayes factors, i.e. ratios of marginal likelihood, see T8-model-selection-basics.html

**Situation where reversible jump shines** when the set of models to consider is countably infinite

Two main ideas:

- To accommodate an infinite number of models inside a bigger one: an augmented distribution construction (cousin of “model saturation” but distinct)
- To make it easier to “jump between models”: a different perspective on the Metropolis Hastings ratio

We will illustrate that augmented distribution on the “rabbit hole” example we used to introduce “model saturation”

See T8-model-selection-basics.html#(3) for a review of this example

**Recall: model saturation**

- let \({\mathscr{Z}}_1\) for model saturation, we used

\[ \begin{align*} {\mathscr{Z}}' &= \{1, 2\} \times {\mathscr{Z}}_1 \times {\mathscr{Z}}_2 \\ \tilde p(\mu, z_1, z_2, x) &= p(\mu) p_1(z_1) p_2(z_2) \ell_{\mu}(x | z_{\mu}) \end{align*} \]

- problem: with many models, this state space will quickly grow!

**Reversible jump augmentation** exploit the fact that each state space is a product space: for example here \({\mathscr{Z}}_i = {\mathbf{R}}^i\). Then only keep a copy of the largest space considered (here \({\mathscr{Z}}_2 = {\mathbf{R}}^2\)), plus a categorical variable \(\mu\) to keep track of which model we are currently in)

\[ \begin{align*} {\mathscr{Z}}' &= \{1, 2\} \times {\mathscr{Z}}_2 \\ \tilde p(\mu, z, x) &= p(\mu) (p_1(z_1) a(z_2) \ell_1(x | z_1))^{{{\bf 1}}[\mu = 1]} (p_2(z) \ell_2(x | z))^{{{\bf 1}}[\mu = 2]} \end{align*} \]

Here \(a\) is any auxiliary distribution you want, often taken to be the same as one of the priors (only formal requirement is that you should be able to sample iid from it)

Note: we recover the two models as follows:

\[ \begin{align*} \tilde p(z_1, x | \mu = 1) &= p_1(z_1) \ell_1(x | z_1) \\ \tilde p(z, x | \mu = 2) &= p_2(z_1, z_2) \ell_2(x | z) \end{align*} \]

Generalization to any finite set of models is simple. What about an infinite set of models?

Suppose now we have an infinite set of models with state spaces, \({\mathscr{Z}}_1, {\mathscr{Z}}_2, {\mathscr{Z}}_3, \dots\), where again each model’s state space is a product space, for example \({\mathscr{Z}}_i = {\mathbf{R}}^i\).

Do you see a problem?

Recall: in the construction we just did, “keep a copy of the largest space considered […] plus a categorical variable \(\mu\) to keep track of which model we are currently in”

- To handle an infinite set of models, we need an MCMC algorithm defined over \({\mathbf{R}}^\infty\).
- Let \(z = (z_1, z_2, \dots)\) denote the infinite list that we cannot explicitly store in memory

- The theory allows us to do this
- What about implementation?

**Key idea** “lazy” representation of infinite dimensional objects. We only keep in memory the prefix of \(z\) that we need at current iteration. Use iid draws from \(a\) to fill in missing parts when we need more.

- Suppose at the current iteration your MCMC chain is at model \(\mu = 1\)
- We have \[\tilde p(z, x| \mu = 1) = p_1(z_1) \ell_1(x | z_1) \prod_{i=2}^\infty a(z_i)\]
- Right now we only keep \(z_1\) in memory
- We make a proposal on \(\mu\), say we propose \(\mu = 2\)
- Now we need \(z_2\)
- Just pretend that we did a draw from \(a\) just before proposing: i.e. \(z_2 \sim a(\cdot)\)
- in the code, we do it after the proposal, and that’s equivalent algorithmically to doing it just before proposing,
- but to analyze the chain, we pretend we did it just before (for all \(i > 1\) in the infinite product) since this allows us to invoke a deterministic alternation argument

**Note** the same idea of “lazy” representation of infinite dimensional objects is also key to many inference algorithms used in the Bayesian non-parametric literature.

- Just proposing a change in the model indicator \(\mu\) without changing the other variables (as we did in the last slide) can lead to very low acceptance rates
- This could be because the “interpretation” of say \(z_1\) could be very different in model 1 versus model 2
- Or it could be that \(a(\cdot)\) is far from \(\tilde p(z_2| \mu = 2)\)

- Idea: propose a change to both \(z\) and \(\mu\) at the same time
- Observation: sometimes it is more natural to use “deterministic proposals” (true in general, not just in context of reversible jumb)
- \(z' \sim q(\cdot | z)\), where \(q(\cdot | z) = \delta_{f(z)}(\cdot)\)
- i.e. more plainly: \(z' = f(z)\)

- From the first idea, we have “padded” a variable number of auxiliary variables, using \(a(\cdot)\) in order to be able to build diffeomorphic mappings \(\Psi\) (more specifically, mappings with non-vanishing Jacobians). The mappings \(\Psi\) can be thought of as proposals.
- We may need more than one \(\Psi_j\), selected at random according to some probabilities \(\rho_{i\to j}\).

**Dimensionality matching:** a necessary conditions for the mapping to be diffeomorphic is that the input dimensionality of \(\Psi\) should match the output dimensionality of \(\Psi\).

**Consequence:** let us say that we want to “jump” from a model with \(m_1\) dimensions into one with \(m_2\) dimensions. What constraints do we have on the number \(n_1\) of auxiliary variables we add to the first model, and the number \(n_2\) we add to the second?

**Notation:**

- \(p(i)\) prior on model \(i\)
- \(\pi_i\) posterior given model \(i\)
- \(i,i'\) old and proposed model indices
- \(x, x'\) old and proposed model parameters
- \(u_i\): auxiliary variables before the move, input into \(\Psi_j\), with density \(g_i\)
- \(u_{i'}\): auxiliary variables after the move, output of \(\Psi_j\), with density \(g_{i'}\)

**Ratio for RJMCMC:**

\[ \frac{p(i')\pi_{i'}(x')}{p(i)\pi_i(x)} \frac{\rho_{i'\to i}}{\rho_{i\to i'}} \frac{g_{i'}(u_{i'})}{g_{i}(u_{i})} \left| J(x', u_2) \right| \]

**Example:** textbook, page 365.