STAT 520 - Bayesian Analysis
Alexandre Bouchard-Côté
3/20/2019
Goals
Today:
- Limitations of slice sampling and MH
- Parallel tempering
- HMC
Recap: modelling the randomness of MCMC
- Think of a very large population of people each with a laptop running MCMC
- There are 3 rooms corresponding to the 3 states \(A = \{0, 1, 2\}\)
- Depending on the state on their laptop, each person goes in the corresponding room
Recap: Occupancy vector and transition matrix
- Room occupancies can be organized into a vector. Normalize by number of people and get a distribution.
- If you start with occupancies \(v\) and let everyone run their MCMC for one step and do their room change, you get a new occupancy \(v'\)
- Exercise: convince yourself \(v' = M v\) where row \(i\) of \(M\) is the transition probabilities for an MCMC at state \(i\). I.e. \[M_{i,j} = {\mathbb{P}}(X_{t+1} = j | X_t = i)\]
Recap: Stationarity
- The distribution of interest is \(\pi = (1/3, 1/3, 1/3)\).
- Let us try to build \(M\) such that if we start with an occupancy given by the distribution of interest, \(v = \pi\), then after one step that occupancy is exactly preserved, \[v' = M \pi \Rightarrow v' = \pi\] (this is called global balance)
- For ‘idea/walk 2’ this is easy to see: between any pair of rooms, \(i\) and \(j\) the number of people going from \(i\) to \(j\) is the same as the number going from \(j\) to \(i\) (this is an instance of detailed balance/reversibility)
- Exercise: show ‘idea/walk 1’ does not satisfy global balance whereas ‘idea/walk 2’ does
- If we have \[\pi = M \pi,\] this means \(\pi\) is a fixed point of applying \(M\).
- Then a general principle in maths is that if you repeatedly apply a map under certain condition you will converge to a fixed point of that map, i.e. say everyone starts at zero: \(v_0 = (1, 0, 0)\), define \[v_n = M^n v_0\]
- Then \(v_n \to \pi\)
- The twist is that the fixed “point” is really a vector here, which we interpret as a distribution!

Recap: (finite) Markov chain theory
- The result we talked about in the last slide, \(v_n \to \pi\), is called “convergence of the marginals”
- It is not quite what we want.
- What we want is a Law of Large Number
- This turns out to hold too if:
- We have an homogeneous Markov chain that satisfy global balance
- Some notion that we can explore the full state space from any starting point. E.g. in discrete state space, just need “irreducibility”, i.e. there is a positive probability to go from any start state to any end state
- Having a Central Limit Theorem is nice to have too:
- gives us a principled way to decide when to end the simulation
- more on this soon, CLT is the theoretical foundations of Effective Sample Size (ESS)
- for discrete Markov chains, CLT holds under the same conditions as LLN (twist is the variance of the asymptotic normal distribution)
- A bit more work to describe the conditions for continuous state spaces, might talk a bit more about Markov chain theory later if there is time/interest
Recap: slice sampling
Now let us see how JAGS proposes path changes, and then how to make the accept-reject decision
- JAGS will propose a change for only one variable at the time
- Question: look at the bivariate posterior density; if I fix one variable and change the other one, what do I get?
- Hint: this will be used to define a “slice”
- How JAGS/slice sampler proposal works:
- Look at current point \(x_i\),
- compute a maximum height \(= \gamma(x_i)\) (length of black bold line)
- sample height \(h\) uniformly in \([0, \gamma(x_i)]\) (i.e. sample a point in the black bold line)
- Propose to move \(x^\star\) uniformly in a window (green) around the current position, say \(x^\star\) uniform in \([x_i - 1, x_i + 1]\)
- These two steps define a proposed point \((x^\star, h)\). If this proposed is still under the density (example shown as green circle), accept, otherwise (example shown as red circle) reject and stay where you were

For more on slice sampling (in particular, how to avoid to be sensitive with respect to the size of the window size): see the original slice sampling paper by R. Neal
Recap: relation with Metropolis-Hastings (MH)
- The slice sampler proposes two things at each step:
- A new \(x\) position, \(x*\)
- A height \(h\) uniformly from 0 to the max height
- Metropolis-Hastings: we can replace the uniform height by a coin flip
- How? Let’s consider to cases
Case 1: proposing to go “uphill”, \(\gamma(x^*) \ge \gamma(x)\). In this case, we always accept! See the picture:

Case 2: proposing to go “downhill”, \(\gamma(x^*) < \gamma(x)\). In this case, we may accept or reject… See the picture:

We can compute the probability that we accept! It’s the height of the green stick in bold divided by the height of the black stick in bold:
\[\frac{\gamma(x^*)}{\gamma(x_i)}\]
This quantity is called the MH ratio
First failure mode: unidentifiability

- The two plots below show the first 100 iterations from JAGS
- Top one: exploring the unidentifiable posterior based on 10 observations (5 of which are failure)
- Bottom one: exploring the unidentifiable posterior based on 10000 observations (5000 of which are failure)
- As you can see, for this problem the posterior gets harder and harder to explore as the size of the dataset increase

Why is this landscape hard to explore for slice sampling?
Understanding the failure
- Unidentifiable model: a narrow diagonal tunnel is hard to explore if you are only allowed to move along x or y axis

Note: other situations may cause this as well, but unidentifiability (or “weak identifiability”) is the most common one
Second failure mode: “multimodality”

- Real example: Bayesian clustering
- Details of the model soon
- To start with, let us compare the trace plot for JAGS/Stan vs. Blang

- Let us compare to a more powerful sampler: parallel tempering via Blang

Understanding the failure
- Multimodal: regions disconnected or only connected through a region of low “height” (density) makes exploration hard

Solutions to these two failure modes
- Hamiltonian Monte Carlo
- e.g. implemented in
Stan
; only works for continuous models
- will generally address the unidentifiability failure mode, not always the multimodal one
- Parallel Tempering
- implemented in
Blang
; works for continuous and discrete
- solves both problems but may need many cores to do it quickly
Parallel tempering: annealing

Combination of two ideas: “annealing” and “parallel chains”
- First idea: annealing
- suppose a density has height \(h\) at a point
- consider \(h^{0.0001}\), note that this \(\approx 1\), for both \(h=10\) and \(h=0.1\)
- taking a small power of a positive number brings it closer to \(1\)
- let’s do that with every point in the landscape we seek to explore!
\[\pi_{\beta}(x) = (\pi(x))^{\beta}\]
where \(\beta\) controls an interpolation between:
- the uniform distribution (\(\beta = 0\), easy to sample from)
- the posterior distribution (\(\beta = 1\), hard)
Let us call \(\beta\) the annealing parameter
Now we have several different “landscapes”, from smooth to rough. We will explore all of them simultaneously!
Parallel tempering: swaps
- Let’s pick a sequence of annealing parameters, say \((\beta_0, \dots, \beta_N) = (0, 1/N, 2/N, \dots, 1)\)
- Second idea: parallel tempering
- Run several MCMC chains in parallel (each say based on slice sampling) for each landscape \(\pi_i(x) = (\pi(x))^{\beta_i}\) including the flat one (\(i=0\) where \(\beta = 0\)) and the landscape of interest (\(i=N\) where \(\beta = 1\))
- This ensemble of chains provides us with samples from the high dimensional landscape \(\pi(x_0, x_1, \dots, x_N) = \prod_i \pi_i(x_i)\)
- At the end, only keep the states in \(x_N\), i.e. those corresponding to \(\beta = 1\) (the posterior)
- But use the other ones to propose swaps between pairs of chains to allow “jumps” across the state space
- Swap are accepted or rejected according to the Metropolis rule from earlier today (optional exercise: can you derive the acceptance probability?)

Non-reversible parallel tempering
- You may need many parallel chains to be able to swap
- Naive parallel tempering then gets really slow
- Ongoing work in my research group: non-reversible parallel tempering avoid this issue
Parallel Tempering: Reversible (top) versus non-reversible (bottom)

Blang: a PPL implementation of non-reversible parallel tempering
- You write your language in a JAGS-like model
- Blang automatically creates a nice sequence of targets (a bit nicer than \(\pi^\beta\), which may not be a probability distribution for low \(\beta\))
- Multi-threaded: Blang runs several interacting chains in parallel
- When you do you capstone project, consider it if you have a complex Bayesian model or need model selection :)
- See the project website for more information
- Freedom respecting open source license
Discussion on last week example: change point problem / segmentation
- In a healthy cell, chromosomes (long strands of DNA) come in pairs (except for the sex chromosomes).
- In certain cancers, some parts of the chromosome come to lose one of the two copies.
- Suppose we segment one (non-sex) chromosome into segments \(0, 1, 2, \dots, 49\).
- For each segment, let \(S_i\) encodes the number of copies in this segment.
- Examples of possible realizations for the copy numbers:

Change point problem
- Goal of inference: copy numbers, which are hard to observe directly.
- Instead, use sequencing data.
- DNA of many cells is pooled and cut into small pieces
- sequencing machines read some of these small pieces and map into one of the 50 segments.
- Key point: if a segment has two copies, there should be on average twice as many reads mapped into it compared to a segment with only one copy.
- Let \(K\) be the average number of reads we get per segment and per chromosome.
- Could be determined experimentally or inferred, let say we known \(K=3.5\).
Data looks like this:

Can you build a Bayesian model for that?
How to pick distributions?
- Use the type of the realization to narrow down the possibilities
- Example: for the Poisson emissions, we knew the support had to be \(\{0, 1, 2, 3, \dots\}\); this excludes Bernoulli and Uniform.
- Exploratory data analysis
- Ask experts/check literature if possible
- Theoretical grounds (asympotic theory, mechanistic models, etc)
- Example: “law of rare events” provides some interpretation
- Try several methods and use Bayesian model selection techniques (next week, if time permits)
Frequently used distributions
By type:
- Binary data types
- Bernoulli (by definition)
- Enumeration data types
- Categorical (by definition)
- Integers \(\ge 0\)
- Geometric (mode at zero)
- Poisson (mode close to mean)
- For more detailed models, Negative Binomial (2 parameters)
- Continuous \(\ge 0\):
- Continuous in \([0, 1]\):
- Continuous Uniform
- For more detailed models, Beta
- Continuous on real line:
By interpretation:
- Sum of Bernoullis:
- Population survey:
- Waiting times:
- Geometric, Negative Binomial
Mathematical model
- Let \(S_0, S_1, \dots, S_{49}\) denote a list of random variable modelling the unobserved states.
- Each state takes two possible values (the “copy number state”).
- We want to encourage a priori the states to “stick” to the same value (i.e. the copy number does not change too often as we move along the genome), so we set \[P(S_i = s | S_{i-1} = s) = 0.95\]
- Exercise (part 1): how to write this in our mathematical notation, \(S_i | S_{i-1} \sim ???\) (hint: use Bernoullis and change the encoding of the states as 0 and 1 instead of 1 and 2)
- Exercise (part 2): write a JAGS model (hint: check “ifelse(x, a, b)” in the JAGS manual
- Also have a look at the Blang model for comparison
- Let \(Y_0, Y_1, \dots, Y_{49}\) denote the observations (number of reads mapped for each of the 50 segments).
- We will use the following model (a common choice in bioinformatics): \[Y_i | S_i \sim \text{Poisson}(3.5 (S_i + 1))\]
- Recall: Poisson distribution
- Exercise (part 3): draw graphical model
- Recall \(K\) is the average number of reads we get per segment and per chromosome.
- How to proceed is \(K\) is not known?
- Part 4: Implement this idea in Blang as follows:
- In the variable declarations, make \(K\)
random
instead of param
, and initialize it with ?: latentReal
intead of fixedReal(3.5)
- Add a prior to \(K\) (see distributions reference)
- Graphical model of the new model?