Parallel tempering

Alexandre Bouchard-Côté


Unidentifiably/weakly identifiable models create posterior distributions that are hard to sample from.



Solution: parallel tempering, the MCMC jackhammer

Parallel tempering: annealing


Combination of two ideas: “annealing” and “parallel chains”

\[\pi_{\beta}(x) = (\pi(x))^{\beta}\]

where \(\beta\) controls an interpolation between:

If the prior is defined over an unbounded set, generalize into: \[\pi_{\beta}(x) = \text{prior}(x)(\text{likelihood}(x))^{\beta}\]

Let us call \(\beta\) the annealing parameter

Now we have several different “landscapes”, from smooth to rough. We will explore all of them simultaneously!

Parallel tempering: swaps

Poll: what is the swap acceptance ratio?

Suppose you just proposed a swap between chain \(i\) and \(i+1\)

  1. One
  2. \((\pi_i(x_{i}) \pi_{i+1}(x_{i+1})) / (\pi_i(x_{i+1}) \pi_{i+1}(x_{i}))\)
  3. \((\pi_i(x_{i+1}) \pi_{i+1}(x_{i})) / (\pi_i(x_{i}) \pi_{i+1}(x_{i+1}))\)
  4. \(\pi_i(x_{i+1}) / \pi_{i+1}(x_{i})\)
  5. \(\pi_{i+1}(x_{i}) / \pi_i(x_{i+1})\)

Swap acceptance ratio

Let \(x = (x_0, x_1, \dots, x_N)\) denote the current state, and the proposed state is \(x' = (x_0, x_1, \dots, x_{i-1}, \color{blue}{x_{i+1}}, \color{red}{x_i}, x_{i+2}, \dots, x_N)\)

Note: the proposal is symmetric, hence the ratio ignores the proposal and only involves the ratio of augmented target distributions:

\[ \begin{align*} \frac{\tilde \pi(x')}{\tilde \pi(x)} &= \frac{\pi_0(x_0) \pi_1(x_1) \dots \pi_{i-1}(x_{i-1}) \color{blue}{\pi_i(x_{i+1})} \color{red}{\pi_{i+1}(x_i)} \pi_{i+2}(x_{i+2}) \dots \pi_N(x_N)}{\pi_0(x_0) \pi_1(x_1) \dots \pi_{i-1}(x_{i-1}) \color{blue}{\pi_i(x_{i})} \color{red}{\pi_{i+1}(x_{i+1})} \pi_{i+2}(x_{i+2}) \dots \pi_N(x_N)} \\ &= \frac{\color{blue}{\pi_i(x_{i+1})} \color{red}{\pi_{i+1}(x_i)}}{\color{blue}{\pi_i(x_{i})} \color{red}{\pi_{i+1}(x_{i+1})}}. \end{align*} \]

A law of large numbers

We now have all the ingredients to demonstrate how the “MCMC methodology game” can be applied to get a law of large numbers for the parallel tempering algorithm.

  1. The swap move is a MH move and hence is \(\pi\)-invariant
  2. We assume the “local exploration kernels” are \(\pi\)-invariant and irreducible
  3. Hence using either mixture of alternation of the “local exploration steps” and “communication steps” yields a \(\pi\)-invariant and irreducible sampler
  4. Hence, we have a law of large numebers

Interesting twist coming up soon: deciding which of mixture or alternation to use can have a huge effect in the context of parallel tempering!

Non-reversible parallel tempering

Parallel Tempering: Reversible (top) versus non-reversible (bottom)