Alexandre Bouchard-Côté

Unidentifiably/weakly identifiable models create posterior distributions that are hard to sample from.

**Example** https://www.stat.ubc.ca/~bouchard/courses/stat520-sp2020-21/T9-consistency.html#(11)

- Demonstration that slice sampling alone struggle in this problem: PT_on_an_unidentifiable_problem.html

- Note: many other standard MCMC schemes also struggle with this kind of problems (hit and run, Metropolis-Hastings, Langevin schemes, Hamiltonian, etc)

**Solution: parallel tempering**, the MCMC jackhammer

Combination of two ideas: “annealing” and “parallel chains”

- First idea: annealing
- let us start by assuming for simplicity the prior is uniform on a bounded region
- suppose a density has height \(h\) at a point
- consider \(h^{0.0001}\), note that this \(\approx 1\), for both \(h=10\) and \(h=0.1\)
- taking a small power of a positive number brings it closer to \(1\)
- let’s do that with every point in the landscape we seek to explore!

\[\pi_{\beta}(x) = (\pi(x))^{\beta}\]

where \(\beta\) controls an interpolation between:

- the uniform distribution (\(\beta = 0\), easy to sample from)
- the posterior distribution (\(\beta = 1\), hard)

If the prior is defined over an unbounded set, generalize into: \[\pi_{\beta}(x) = \text{prior}(x)(\text{likelihood}(x))^{\beta}\]

Let us call \(\beta\) the **annealing parameter**

Now we have several different “landscapes”, from smooth to rough. We will explore all of them simultaneously!

- Let’s pick a sequence of annealing parameters, say \((\beta_0, \dots, \beta_N) = (0, 1/N, 2/N, \dots, 1)\)
- Second idea: parallel tempering
- Run several interacting MCMC chains in parallel (each say based on slice sampling) for each landscape \(\pi_i(x) = \pi_{\beta_i}\) including the easy-to-explore landscape (\(i=0\) where \(\beta = 0\)) and the landscape of interest (\(i=N\) where \(\beta = 1\))
- Idea formalized using an
**augmented distribution**\[\tilde \pi(x_0, x_1, \dots, x_N) = \pi_0(x_0) \pi_1(x_1) \dots \pi_N(x_N)\] - This ensemble of chains provides us with samples from the high dimensional distribution \(\tilde \pi(x_0, x_1, \dots, x_N)\)

- At the end, only keep the states in \(x_N\), i.e. those corresponding to \(\beta = 1\) (the posterior)
- But use the other ones to propose swaps between pairs of chains to allow “jumps” across the state space
- Swap are accepted or rejected according to the Metropolis rule from earlier today (coming up: can you derive the acceptance probability?)

- Visualization of the algorithm: pt-algo.pdf

Suppose you just proposed a swap between chain \(i\) and \(i+1\)

- One
- \((\pi_i(x_{i}) \pi_{i+1}(x_{i+1})) / (\pi_i(x_{i+1}) \pi_{i+1}(x_{i}))\)
- \((\pi_i(x_{i+1}) \pi_{i+1}(x_{i})) / (\pi_i(x_{i}) \pi_{i+1}(x_{i+1}))\)
- \(\pi_i(x_{i+1}) / \pi_{i+1}(x_{i})\)
- \(\pi_{i+1}(x_{i}) / \pi_i(x_{i+1})\)

Let \(x = (x_0, x_1, \dots, x_N)\) denote the current state, and the proposed state is \(x' = (x_0, x_1, \dots, x_{i-1}, \color{blue}{x_{i+1}}, \color{red}{x_i}, x_{i+2}, \dots, x_N)\)

Note: the proposal is symmetric, hence the ratio ignores the proposal and only involves the ratio of augmented target distributions:

\[ \begin{align*} \frac{\tilde \pi(x')}{\tilde \pi(x)} &= \frac{\pi_0(x_0) \pi_1(x_1) \dots \pi_{i-1}(x_{i-1}) \color{blue}{\pi_i(x_{i+1})} \color{red}{\pi_{i+1}(x_i)} \pi_{i+2}(x_{i+2}) \dots \pi_N(x_N)}{\pi_0(x_0) \pi_1(x_1) \dots \pi_{i-1}(x_{i-1}) \color{blue}{\pi_i(x_{i})} \color{red}{\pi_{i+1}(x_{i+1})} \pi_{i+2}(x_{i+2}) \dots \pi_N(x_N)} \\ &= \frac{\color{blue}{\pi_i(x_{i+1})} \color{red}{\pi_{i+1}(x_i)}}{\color{blue}{\pi_i(x_{i})} \color{red}{\pi_{i+1}(x_{i+1})}}. \end{align*} \]

We now have all the ingredients to demonstrate how the “MCMC methodology game” can be applied to get a law of large numbers for the parallel tempering algorithm.

- The swap move is a MH move and hence is \(\pi\)-invariant
- We assume the “local exploration kernels” are \(\pi\)-invariant and irreducible
- Hence using either mixture of alternation of the “local exploration steps” and “communication steps” yields a \(\pi\)-invariant and irreducible sampler
- Hence, we have a law of large numebers

Interesting twist coming up soon: deciding which of mixture or alternation to use can have a huge effect in the context of parallel tempering!

- You may need many parallel chains to be able to swap frequently
*Reversible*parallel tempering gets really slow in this context- Ongoing work in my research group:
*non-reversible*parallel tempering avoid this issue

**Parallel Tempering: Reversible (top) versus non-reversible (bottom)**