Recap: Occupancy vector and transition matrix

Room occupancies can be organized into a vector. Normalize by number of people and get a distribution.
If you start with occupancies \(v\) and let everyone run their MCMC for one step and do their room change, you get a new occupancy \(v'\)
Exercise: convince yourself \(v' = M v\) where row \(i\) of \(M\) is the transition probabilities for an MCMC at state \(i\). I.e. \[M_{i,j} = {\mathbb{P}}(X_{t+1} = j | X_t = i)\]

Recap: Stationarity

The distribution of interest is \(\pi = (1/3, 1/3, 1/3)\).
Let us try to build \(M\) such that if we start with an occupancy given by the distribution of interest, \(v = \pi\), then after one step that occupancy is exactly preserved, \[v' = M \pi \Rightarrow v' = \pi\] (this is called global balance)
For ‘idea/walk 2’ this is easy to see: between any pair of rooms, \(i\) and \(j\) the number of people going from \(i\) to \(j\) is the same as the number going from \(j\) to \(i\) (this is an instance of detailed balance/reversibility)
Exercise: show ‘idea/walk 1’ does not satisfy global balance whereas ‘idea/walk 2’ does
If we have \[\pi = M \pi,\] this means \(\pi\) is a fixed point of applying \(M\).
Then a general principle in maths is that if you repeatedly apply a map under certain condition you will converge to a fixed point of that map, i.e. say everyone starts at zero: \(v_0 = (1, 0, 0)\), define \[v_n = M^n v_0\]
Then \(v_n \to \pi\)
The twist is that the fixed “point” is really a vector here, which we interpret as a distribution!

drawing

Recap: (finite) Markov chain theory

The result we talked about in the last slide, \(v_n \to \pi\), is called “convergence of the marginals”
It is not quite what we want.
What we want is a Law of Large Number
This turns out to hold too if:
- We have an homogeneous Markov chain that satisfy global balance
- Some notion that we can explore the full state space from any starting point. E.g. in discrete state space, just need “irreducibility”, i.e. there is a positive probability to go from any start state to any end state
Having a Central Limit Theorem is nice to have too:
- gives us a principled way to decide when to end the simulation
- more on this soon, CLT is the theoretical foundations of Effective Sample Size (ESS)
- for discrete Markov chains, CLT holds under the same conditions as LLN (twist is the variance of the asymptotic normal distribution)
A bit more work to describe the conditions for continuous state spaces, might talk a bit more about Markov chain theory later if there is time/interest

Recap: slice sampling

Now let us see how JAGS proposes path changes, and then how to make the accept-reject decision

JAGS will propose a change for only one variable at the time
- Question: look at the bivariate posterior density; if I fix one variable and change the other one, what do I get?
- Hint: this will be used to define a “slice”
How JAGS/slice sampler proposal works:
- Look at current point \(x_i\),
  - compute a maximum height \(= \gamma(x_i)\) (length of black bold line)
  - sample height \(h\) uniformly in \([0, \gamma(x_i)]\) (i.e. sample a point in the black bold line)
- Propose to move \(x^\star\) uniformly in a window (green) around the current position, say \(x^\star\) uniform in \([x_i - 1, x_i + 1]\)
- These two steps define a proposed point \((x^\star, h)\). If this proposed is still under the density (example shown as green circle), accept, otherwise (example shown as red circle) reject and stay where you were

drawing

For more on slice sampling (in particular, how to avoid to be sensitive with respect to the size of the window size): see the original slice sampling paper by R. Neal

Recap: relation with Metropolis-Hastings (MH)

The slice sampler proposes two things at each step:
- A new \(x\) position, \(x*\)
- A height \(h\) uniformly from 0 to the max height
Metropolis-Hastings: we can replace the uniform height by a coin flip
How? Let’s consider to cases

Case 1: proposing to go “uphill”, \(\gamma(x^*) \ge \gamma(x)\). In this case, we always accept! See the picture:

drawing

Case 2: proposing to go “downhill”, \(\gamma(x^*) < \gamma(x)\). In this case, we may accept or reject… See the picture:

drawing

We can compute the probability that we accept! It’s the height of the green stick in bold divided by the height of the black stick in bold:

\[\frac{\gamma(x^*)}{\gamma(x_i)}\]

This quantity is called the MH ratio

Generalization: non-uniform proposal

Instead of proposing uniformly on a window \([x_i - 1, x_i + 1]\), we can use other proposal distributions \(q(x^*|x_i)\)
- For example, a normal distribution centered at the current point \(x_i\)
If \(q\) is symmetric, \(q(x|x') = q(x'|x)\) then the MH ratio is identical \(\gamma(x^*)/\gamma(x_i)\)
If not, accept the proposal \(x^*\) with probability

\[\min\left\{1, \frac{\gamma(x^*)}{\gamma(x_i)}\frac{q(x_i|x^*)}{q(x^*|x_i)}\right\}\]

Notice that we indeed get back our simpler formula \(\gamma(x^*)/\gamma(x_i)\) when \(q\) is symmetric.

First failure mode: unidentifiability

drawing

The two plots below show the first 100 iterations from JAGS
- Top one: exploring the unidentifiable posterior based on 10 observations (5 of which are failure)
- Bottom one: exploring the unidentifiable posterior based on 10000 observations (5000 of which are failure)
As you can see, for this problem the posterior gets harder and harder to explore as the size of the dataset increase

Why is this landscape hard to explore for slice sampling?

Understanding the failure

Unidentifiable model: a narrow diagonal tunnel is hard to explore if you are only allowed to move along x or y axis

drawing

Note: other situations may cause this as well, but unidentifiability (or “weak identifiability”) is the most common one

Second failure mode: “multimodality”

Toy example:

Real example: Bayesian clustering
- Details of the model soon
- To start with, let us compare the trace plot for JAGS/Stan vs. Blang

Let us compare to a more powerful sampler: parallel tempering via Blang

drawing

Understanding the failure

Multimodal: regions disconnected or only connected through a region of low “height” (density) makes exploration hard

drawing

Solutions to these two failure modes

Hamiltonian Monte Carlo
- e.g. implemented in Stan; only works for continuous models
- will generally address the unidentifiability failure mode, not always the multimodal one
Parallel Tempering
- implemented in Blang; works for continuous and discrete
- solves both problems but may need many cores to do it quickly

Parallel tempering: annealing

drawing

Combination of two ideas: “annealing” and “parallel chains”

First idea: annealing
- suppose a density has height \(h\) at a point
- consider \(h^{0.0001}\), note that this \(\approx 1\), for both \(h=10\) and \(h=0.1\)
- taking a small power of a positive number brings it closer to \(1\)
- let’s do that with every point in the landscape we seek to explore!

\[\pi_{\beta}(x) = (\pi(x))^{\beta}\]

where \(\beta\) controls an interpolation between:

the uniform distribution (\(\beta = 0\), easy to sample from)
the posterior distribution (\(\beta = 1\), hard)

Let us call \(\beta\) the annealing parameter

Now we have several different “landscapes”, from smooth to rough. We will explore all of them simultaneously!

Parallel tempering: swaps

Let’s pick a sequence of annealing parameters, say \((\beta_0, \dots, \beta_N) = (0, 1/N, 2/N, \dots, 1)\)
Second idea: parallel tempering
- Run several MCMC chains in parallel (each say based on slice sampling) for each landscape \(\pi_i(x) = (\pi(x))^{\beta_i}\) including the flat one (\(i=0\) where \(\beta = 0\)) and the landscape of interest (\(i=N\) where \(\beta = 1\))
- This ensemble of chains provides us with samples from the high dimensional landscape \(\pi(x_0, x_1, \dots, x_N) = \prod_i \pi_i(x_i)\)
At the end, only keep the states in \(x_N\), i.e. those corresponding to \(\beta = 1\) (the posterior)
But use the other ones to propose swaps between pairs of chains to allow “jumps” across the state space
- Swap are accepted or rejected according to the Metropolis rule from earlier today (optional exercise: can you derive the acceptance probability?)

drawing

Non-reversible parallel tempering

You may need many parallel chains to be able to swap
Naive parallel tempering then gets really slow
Ongoing work in my research group: non-reversible parallel tempering avoid this issue

Parallel Tempering: Reversible (top) versus non-reversible (bottom)

drawing

Blang: a PPL implementation of non-reversible parallel tempering

You write your language in a JAGS-like model
- Example: our unidentifiable problem
Blang automatically creates a nice sequence of targets (a bit nicer than \(\pi^\beta\), which may not be a probability distribution for low \(\beta\))
Multi-threaded: Blang runs several interacting chains in parallel
When you do you capstone project, consider it if you have a complex Bayesian model or need model selection :)
See the project website for more information
- Freedom respecting open source license

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (e.g., via Stan)
- Use the gradient
- Flip the 2D density plot, and imagine the MCMC sampler is a ball moving around

Example: unidentifiable model coded up in Stan

Discussion on last week example: change point problem / segmentation

In a healthy cell, chromosomes (long strands of DNA) come in pairs (except for the sex chromosomes).
In certain cancers, some parts of the chromosome come to lose one of the two copies.
Suppose we segment one (non-sex) chromosome into segments \(0, 1, 2, \dots, 49\).
For each segment, let \(S_i\) encodes the number of copies in this segment.
Examples of possible realizations for the copy numbers:

drawing

Change point problem

Goal of inference: copy numbers, which are hard to observe directly.
Instead, use sequencing data.
- DNA of many cells is pooled and cut into small pieces
- sequencing machines read some of these small pieces and map into one of the 50 segments.
Key point: if a segment has two copies, there should be on average twice as many reads mapped into it compared to a segment with only one copy.
Let \(K\) be the average number of reads we get per segment and per chromosome.
- Could be determined experimentally or inferred, let say we known \(K=3.5\).

Data looks like this:

drawing

Can you build a Bayesian model for that?

How to pick distributions?

Use the type of the realization to narrow down the possibilities
- Example: for the Poisson emissions, we knew the support had to be \(\{0, 1, 2, 3, \dots\}\); this excludes Bernoulli and Uniform.
Exploratory data analysis
Ask experts/check literature if possible
Theoretical grounds (asympotic theory, mechanistic models, etc)
- Example: “law of rare events” provides some interpretation
Try several methods and use Bayesian model selection techniques (next week, if time permits)

Frequently used distributions

By type:

Binary data types
- Bernoulli (by definition)
Enumeration data types
- Categorical (by definition)
Integers \(\ge 0\)
- Geometric (mode at zero)
- Poisson (mode close to mean)
- For more detailed models, Negative Binomial (2 parameters)
Continuous \(\ge 0\):
- Exponential (mode at zero)
- For more detailed models, Gamma
Continuous in \([0, 1]\):
- Continuous Uniform
- For more detailed models, Beta
Continuous on real line:
- Normal

By interpretation:

Sum of Bernoullis:
- Binomial
Population survey:
- Hyper Geometric
Waiting times:
- Geometric, Negative Binomial

Mathematical model

Let \(S_0, S_1, \dots, S_{49}\) denote a list of random variable modelling the unobserved states.
- Each state takes two possible values (the “copy number state”).
- We want to encourage a priori the states to “stick” to the same value (i.e. the copy number does not change too often as we move along the genome), so we set \[P(S_i = s | S_{i-1} = s) = 0.95\]
- Exercise (part 1): how to write this in our mathematical notation, \(S_i | S_{i-1} \sim ???\) (hint: use Bernoullis and change the encoding of the states as 0 and 1 instead of 1 and 2)
- Exercise (part 2): write a JAGS model (hint: check “ifelse(x, a, b)” in the JAGS manual
- Also have a look at the Blang model for comparison
Let \(Y_0, Y_1, \dots, Y_{49}\) denote the observations (number of reads mapped for each of the 50 segments).
We will use the following model (a common choice in bioinformatics): \[Y_i | S_i \sim \text{Poisson}(3.5 (S_i + 1))\]
Recall: Poisson distribution
Exercise (part 3): draw graphical model
Recall \(K\) is the average number of reads we get per segment and per chromosome.
- How to proceed is \(K\) is not known?
- Part 4: Implement this idea in Blang as follows:
  - In the variable declarations, make \(K\) random instead of param, and initialize it with ?: latentReal intead of fixedReal(3.5)
  - Add a prior to \(K\) (see distributions reference)
- Graphical model of the new model?

STAT 520 - Bayesian Analysis

Goals

Recap: sampling on uniform distribution using propose + accept reject

Recap: modelling the randomness of MCMC

Recap: Occupancy vector and transition matrix

Recap: Stationarity

Recap: (finite) Markov chain theory

Recap: slice sampling

Recap: relation with Metropolis-Hastings (MH)

Generalization: non-uniform proposal

First failure mode: unidentifiability

Understanding the failure

Second failure mode: “multimodality”

Understanding the failure

Solutions to these two failure modes

Parallel tempering: annealing

Parallel tempering: swaps

Non-reversible parallel tempering

Blang: a PPL implementation of non-reversible parallel tempering

Hamiltonian Monte Carlo

Discussion on last week example: change point problem / segmentation

Change point problem

How to pick distributions?

Frequently used distributions

Mathematical model