Metropolis-Hastings and slice samplers
Alexandre Bouchard-Côté
Monte Carlo methods
- Recall, from earlier, that Monte Carlo methods refer to the idea of approximating an integral with an average of \(K\) random variables, where \(K\) is the number of iterations in our approximation algorithm.
\[\frac{1}{K} \sum_{k=1}^K f(X_k) \approx \int f(x) \pi(x) {\text{d}}x\]
- Today we introduce Markov chain Monte Carlo (MCMC), which is the most flexible way to do Monte Carlo.
- With MCMC, \(X_1, X_2, \dots\) is the list of samples visited in a Markov chain
- With MCMC, the above approximation is justified using a law of large numbers for Markov chains
Most common setup in Bayesian statistics
- Let \(x\) denote the (vector of) latent (unobserved) variables
- Let \(y\) denote the observations
- Let \(\gamma(x, y)\) denote the joint distribution
- We will omit dependence on \(y\) in the notation: \(\gamma(x) = \gamma(x, y)\)
- We assume that for a give hypothesized \(x\), we can compute the numerical value \(\gamma(x)\)
- Let \(\pi(x | y)\) denote the posterior
- Again, omit \(y\) from the notation: \(\pi(x) = \pi(x | y)\)
- Hence: there is some \(C\) such that \(\pi(x) = \gamma(x) / C\)
- Here \(C\) does not depend on \(x\)
- It happens to be the marginal likelihood but we won’t use this fact today
Outline
- Today we will go over one of the most powerful way to simulate the random variables suitable for Monte Carlo average: the Metropolis-Hastings algorithm
- Side note on pedagogic approach: I will use a new route to teach this material
- typically Metropolis Hastings (MH) is introduced by “guessing” a certain ratio called the MH ratio which is then shown to have a nice property (“invariance”); then one special case of this ratio is used to introduce a method called “slice sampling”
- here I will do the reverse: I will introduce “slice sampling” which is more intuitive and justify it from first principles, then derive MH as a consequence of slice sampling
Prerequisite: a bit of Markov chain theory
Question: how to design a sampler for a uniform distribution on a set \(A\)?
Example: \(A = \{0, 1, 2\}\)
Idea: use a random walk so that it scales to complicated/high dimensional \(A\)
- Move to a neighbor at random
- Move to a neighbor at random, except at the boundary, where you stop moving
- Move to a neighbor at random, including yourself (self-transition)
- None of these walks would work
Markov chain 1 (incorrect)
Move to one neighbor at random?
How to check if this is correct:
- Naive method: simulate a long chain, compute a histogram from the list of states visited
- Exercise: explain why the code below is equivalent to the “Naive method”
compute_stationary_distribution <- function(matrix) {
eig <- eigen(t(M))
eigenvalues = eig$values
i <- which.max(Re(eigenvalues[abs(Im(eigenvalues)) < 1e-6]))
statio <- (eig$vectors)[,i]
statio <- statio / sum(statio)
return(statio)
}
M = matrix(c(
0, 1, 0,
1/2, 0, 1/2,
0, 1, 0
), nrow=3, ncol=3, byrow=T)
barplot(compute_stationary_distribution(M))

Markov chain 2 (incorrect)
Move to a neighbor at random, except at the boundary, where you stop moving.
No: you would get a random stationary distribution (either a Dirac delta at state 1 or a Dirac delta at state 3)
Markov chain 3 (incorrect)
Move to a neighbor at random, including a self-transition
M = matrix(c(
1/2, 1/2, 0,
1/4, 1/2, 1/4,
0, 1/2, 1/2
), nrow=3, ncol=3, byrow=T)
barplot(compute_stationary_distribution(M))

Markov chain 4: propose + accept reject
- Say we are at position \(x_t \in A\) at iteration \(t\)
- Propose to go to left or right with equal probability (an example of a proposal distribution \(q(x^* | x_t)\)).
- We get a proposal \(x^*\)
- However this proposal might be outside of \(A\)!
- Basic accept-reject step:
- If \(x^* \in A\), set \(x_{t+1} = x^*\) (accept)
- Else, set \(x_{t+1} = x_t\) (reject)
M = matrix(c(
1/2, 1/2, 0,
1/2, 0, 1/2,
0, 1/2, 1/2
), nrow=3, ncol=3, byrow=T)
barplot(compute_stationary_distribution(M))

Understanding why idea/walk 1 fails while idea/walk 4 works
- Think of a very large population of people each with a laptop running MCMC
- There are 3 rooms corresponding to the 3 states \(A = \{0, 1, 2\}\)
- Depending on the state on their laptop, each person goes in the corresponding room
Occupancy vector and transition matrix
- Room occupancies can be organized into a vector. Normalize by number of people and get a distribution.
- If you start with occupancies \(v\) and let everyone run their MCMC for one step and do their room change, you get a new occupancy \(v'\)
- Exercise: convince yourself \(v' = v M\) where row \(i\) of \(M\) is the transition probabilities for an MCMC at state \(i\). I.e. \[M_{i,j} = {\mathbb{P}}(X_{t+1} = j | X_t = i)\]
Stationarity
- The distribution of interest is \(\pi = (1/3, 1/3, 1/3)\).
- Let us try to build \(M\) such that if we start with an occupancy given by the distribution of interest, \(v = \pi\), then after one step that occupancy is exactly preserved, \[v' = \pi M \Rightarrow v' = \pi\] (this is called global balance)
- For ‘idea/walk 4’ this is easy to see: between any pair of rooms, \(i\) and \(j\) the number of people going from \(i\) to \(j\) is the same as the number going from \(j\) to \(i\) (this is an instance of detailed balance/reversibility, \(M_{i,j} \pi_i = M_{j,i} \pi_j\))

- Compare with ‘idea/walk 1’ which does not satisfy global balance:

- Exercises:
- show that detailed balance/reversibility implies global balance
- show that the other way around is not true (hint: consider for example a “cyclic tour”)
Quick overview of (finite) Markov chain theory
Law of Large Number: if we let \(X_n\) denote which of the three state you visit at time \(n\), for random walk 4, then \[\frac{1}{n} \sum_{i=1}^n f(X_i) \to \frac{f(0) + f(1) + f(2)}{3} \;\;\text{(a.s.)}\]
Optional exercise: prove the LLN for Markov chain in this simplified setup.
- More generally: if a random walk on a discrete space can reach all states from any state (irreducibility), and satisfies global balance, then its sample average satisfies a law of large numbers.
- Do not need aperiodicity for LLN (if you don’t know what aperiodicity is, great! you won’t need it)
- Aperiodicity is a distraction from the MCMC point of view.
- In some sources you sometimes find it discussed in great lengths when the author has a fundamental misunderstanding of how MCMC works
Generalization
- With a simple generalization of this argument, we now have an MCMC algorith for any uniform distribution on a “connected” set \(A\) which could be in any space (continuous, discrete, mixed, etc).
- Condition that the proposal needs to satisfy in this first generalization: symmetric uniform proposal, \(q(x|x') = q(x'|x)\) for all \(x, x' \in A\),
- Exercise:
- Show the Markov chain induced by the above proposal is, for all \(x\in A\) (edit: assume also \(q(x|x) = 0\)) \[K(x'|x) = {{\bf 1}}[x' \neq x] {{\bf 1}}[x' \in A] q(x'|x) + {{\bf 1}}[x' = x] \sum_{x''} {{\bf 1}}[x'' \notin A] q(x''|x)\]
- Use this to prove that \(K\) satisfies detailed balance and hence global balance.
- Only looking at uniform distributions seems quite restrictive!
- But using a clever trick we will show this is actually all we need…
Slice sampling

Slice sampling: some details

- Propose a change for one variable at the time
- Look at current point \(x_i\),
- compute a maximum height \(= \gamma(x_i)\) (length of black bold line)
- sample height \(h\) uniformly in \([0, \gamma(x_i)]\) (i.e. sample a point in the black bold line)
- Propose to move \(x^\star\) uniformly in a window (green) around the current position, say \(x^\star\) uniform in \([x_i - 1, x_i + 1]\)
- These two steps define a proposed point \((x^\star, h)\). If this proposed is still under the density (example shown as green circle), accept, otherwise (example shown as red circle) reject and stay where you were
For more on slice sampling (in particular, how to avoid to be sensitive with respect to the size of the window size): see the original slice sampling paper by R. Neal
Relation with Metropolis-Hastings (MH)
- The slice sampler proposes alternates between these two steps:
- Sample new \(x\) position, \(x*\)
- Sample a height \(h\) uniformly from 0 to the max height
- Metropolis-Hastings: combine the above two moves and replace the uniform height by a coin flip
- How? Let’s consider two cases
Case 1: proposing to go “uphill”, \(\gamma(x^*) \ge \gamma(x)\). In this case, we always accept! See the picture:

Case 2: proposing to go “downhill”, \(\gamma(x^*) < \gamma(x)\). In this case, we may accept or reject… See the picture:

Poll: what is the probability that we will accept in case 2
- \(\gamma(x_i)/\gamma(x^*)\)
- \(\gamma(x^*)/\gamma(x_i)\)
- \(\gamma(x^*) + \gamma(x_i)\)
- \(\gamma(x^*) \cdot \gamma(x_i)\)
- None of the above
Acceptance probability
We can compute the probability that we accept! It’s the height of the green stick in bold divided by the height of the black stick in bold:

\[\frac{\gamma(x^*)}{\gamma(x_i)}\]
This quantity is called the Metropolis ratio
A special case of Metropolis algorithm
We obtain the following algorithm. Suppose the current state is \(x_i\):
- Propose a new position, \(x^*\), uniformly in a ball around \(x_i\)
- Compute the Metropolis ratio \[r = \frac{\gamma(x^*)}{\gamma(x_i)}\]
- Simulate an acceptance variable \[a \sim \text{Bern}(\min\{1, r\})\]
- If \(a = 1\): set \(x_{i+1} = x^*\)
- Else, set \(x_{i+1} = x_i\)
General Metropolis algorithm
More general proposals:
- Instead of using a uniform ball, the same algorithm works (i.e. still admit a law of large numbers) if we generalize the first step into:
- Propose a new position \(x^*\), from a symmetric proposal density \(q(x^*|x_i)\)
- Here symmetric means \(q(x'|x) = q(x|x')\); examples:
- Uniform ball
- Normal distribution centered around the current point
Combinations of proposals:
- Suppose you have two proposals \(q_1\) and \(q_2\)
- The following algorithm also works (“alternation of Metropolis kernels”):
- Do one iteration of the Metropolis algorithm with \(q_1\)
- Do one iteration of the Metropolis algorithm with \(q_2\)
- Go to 1.
Another generalization: Metropolis-Hastings