Recap: sensitivity heuristic

It seems we have introduced new problems as now we again have hyperparameters, namely those for the priors on \(\mu\) and \(s\). Here we picked \(\mu \sim {\text{Beta}}(1,1) = {\text{Unif}}(0, 1)\), \(s \sim \text{Exponential}(1/10000)\)

Key point: yes, but now we are less sensitive to these choices!

Why? Heuristic: say you have a random variable connected to some hyper-parameters (grey squares) and random variables connected to data (circles)

If most of the connections are hyper-parameters: will probably be relatively more sensitive
If there are many more connections to random variables compared to hyper-parameters: will probably be relatively insensitive

Before going hierarchical: for maiden/early flights we had

drawing

After going hierarchical:

drawing

JAGS under the hood

Key technique used by JAGS: slice sampling.

Why should we look under the hood? To understand the failure modes and how to address them.

Things that the slice sampler has problem with (failure modes):

“Unidentifiable models”
“Multimodal problems”

Solutions to these problems:

Hamiltonian Monte Carlo
- e.g. implemented in Stan; only works for continuous models
- will only solve the first failure mode
Parallel Tempering
- implemented in Blang; works for continuous and discrete
- solves both problems but may need many cores to do it quickly

First failure mode: unidentifiability

Recall our unidentifiable model:

model Unidentifiable {
  ...
  laws {
    p ~ ContinuousUniform(0, 1)
    p2 ~ ContinuousUniform(0, 1)
    nFails | p, p2, nLaunches ~ Binomial(nLaunches, p * p2)
  }
}

It gave us the following posterior:

drawing

MCMC: failure modes

Unidentifiability creates computational difficulties.

Example:

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  require(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

require(rjags)

## Loading required package: rjags

## Loading required package: coda

## Linked to JAGS 4.2.0

## Loaded modules: basemod,bugs

require(coda)
require(ggplot2)

modelstring="
  model {
  
    p1 ~ dunif(0, 1)
    p2 ~ dunif(0, 1)
  
    x ~ dbin(p1 * p2, n)
  }
"

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
plots <- list()
i <- 1
for (n in c(10,10000)) {local({
  my.seed <- 1 
  set.seed(my.seed) 
  inits <- list(.RNG.name = "base::Mersenne-Twister",
                .RNG.seed = my.seed)
  
  model <- jags.model(
    textConnection(modelstring), 
    data = list( # Pass on the data:
      'n' = n,
      'x' = n/2), 
    inits=inits) 
  
  samples <- 
    coda.samples(model,
                 c('p1', 'p2'), # These are the variables we want to monitor (plot, etc)
                 100) # number of MCMC iterations

  
  chain <- samples[[1]]
  p1 <- chain[,1]
  p2 <- chain[,2]
  df <- data.frame("p1" = as.numeric(p1), "p2" = as.numeric(p2))
  
  p <- ggplot(df, aes(p1, p2)) +
    xlim(0, 1) +
    ylim(0, 1) +
    geom_path()
  plots[[i]] <<- p

  })
  i <- i + 1
}

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 2
##    Total graph size: 9
## 
## Initializing model
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 2
##    Total graph size: 9
## 
## Initializing model

multiplot(plotlist = plots)

## Loading required package: grid

Both plots show the first 100 iterations from JAGS
- Top one: exploring the unidentifiable posterior based on 10 observations (5 of which are failure)
- Bottom one: exploring the unidentifiable posterior based on 10000 observations (5000 of which are failure)
As you can see, for this problem the posterior gets harder and harder to explore as the size of the dataset increase
- Why? More on this soon after we go over slice sampling.

Second failure mode: “multimodality”

Toy example:

N <- 100000

components <- sample(1:3,prob=c(0.3,0.5,0.2),size=N,replace=TRUE)
mus <- c(0,10,3)
sds <- sqrt(c(1,1,0.1))

samples <- rnorm(n=N,mean=mus[components],sd=sds[components])
data <- data.frame(samples)
ggplot(data, aes(samples)) + geom_density() + theme_bw()

Real example: Bayesian clustering
- Details of the model soon
- To start with, let us compare the trace plot for JAGS/Stan vs. Blang

require(rjags)
require(ggplot2)
require(coda)
require(ggmcmc) # Note: nice package converting JAGS output into tidy format + ggplot-based trace/posterior histograms/etc

## Loading required package: ggmcmc

modelstring="
  model {
    for (k in 1:2) {
      mu[k] ~ dnorm(0, .0001)
      sigma[k] ~ dunif(0, 100)
      alpha[k] <- 1
    }
    pi ~ ddirch(alpha)
    
    for (n in 1:length(observation)) {
      z[n] ~ dcat(pi)
      observation[n] ~ dnorm(mu[z[n]], sigma[z[n]])
    }
  }
"

data <- read.csv( file = "mixture_data_with_header.csv", header = TRUE)

# Make the simulations reproducible
my.seed <- 1 # 1 is arbitrary. The rest of the simulation is deterministic given that initial seed.
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  textConnection(modelstring), 
  data = data, 
  inits = inits)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 300
##    Unobserved stochastic nodes: 305
##    Total graph size: 1216
## 
## Initializing model

samples <- 
  coda.samples(model,
               c('mu'), # These are the variables we want to monitor (plot, compute expectations, etc)
               10000) # number of MCMC iterations

tidy <- ggs(samples)
ggs_traceplot(tidy) + ylim(100,210) + theme_bw()

Let us compare to a more powerful sampler: parallel tempering via Blang

drawing

Prerequisite to understand slice sampling and MH

Question: how to design a sampler for a uniform distribution on a set \(A\), assuming only pointwise evaluation?

Recall: pointwise evaluation means all you can do is: “given \(x\), answer if \(x \in A\)”.

Example: \(A = \{0, 1, 2\}\)

Idea: use a random walk so that it scales to complicated/high dimensional \(A\)

Idea 1 (incorrect)

Move to one neighbor at random.

compute_stationary_distribution <- function(matrix) {
  eig <- eigen(t(M))
  eigenvalues = eig$values
  i <- which.max(Re(eigenvalues[abs(Im(eigenvalues)) < 1e-6]))
  statio <- (eig$vectors)[,i]
  return(statio)
}

M = matrix(c(
  0,   1, 0,
  1/2, 0, 1/2,
  0,   1, 0
), nrow=3, ncol=3, byrow=T)

barplot(compute_stationary_distribution(M))

This code will be explained soon.
For now, if you are not convinced, simulate your own Markov chain and build a histogram, you will recover the above picture.

Idea 2: propose + accept reject

Say we are at position \(x_t \in A\) at iteration \(t\)
Propose to go to left or right with equal probability. Get proposal \(x^*\)
If \(x^* \in A\), set \(x_{t+1} = x^*\) (accept)
Else, set \(x_{t+1} = x_t\) (reject)

M = matrix(c(
  1/2,   1/2, 0,
  1/2,   0,   1/2,
  0,     1/2, 1/2
), nrow=3, ncol=3, byrow=T)

barplot(compute_stationary_distribution(M))

Understanding why idea/walk 1 fails while idea/walk 2 works

Think of a very large population of people each with a laptop running MCMC
There are 3 rooms corresponding to the 3 states \(A = \{0, 1, 2\}\)
Depending on the state on their laptop, each person goes in the corresponding room

Occupancy vector and transition matrix

Room occupancies can be organized into a vector. Normalize by number of people and get a distribution.
If you start with occupancies \(v\) and let everyone run their MCMC for one step and do their room change, you get a new occupancy \(v'\)
Exercise: convince yourself \(v' = M v\) where row \(i\) of \(M\) is the transition probabilities for an MCMC at state \(i\). I.e. \[M_{i,j} = {\mathbb{P}}(X_{t+1} = j | X_t = i)\]

Stationarity

The distribution of interest is \(\pi = (1/3, 1/3, 1/3)\).
Let us try to build \(M\) such that if we start with an occupancy given by the distribution of interest, \(v = \pi\), then after one step that occupancy is exactly preserved, \[v' = M \pi \Rightarrow v' = \pi\] (this is called global balance)
For ‘idea/walk 2’ this is easy to see: between any pair of rooms, \(i\) and \(j\) the number of people going from \(i\) to \(j\) is the same as the number going from \(j\) to \(i\) (this is an instance of detailed balance/reversibility)
Exercise: show ‘idea/walk 1’ does not satisfy global balance whereas ‘idea/walk 2’ does
If we have \[\pi = M \pi,\] this means \(\pi\) is a fixed point of applying \(M\).
Then a general principle in maths is that if you repeatedly apply a map under certain condition you will converge to a fixed point of that map, i.e. say everyone starts at zero: \(v_0 = (1, 0, 0)\), define \[v_n = M^n v_0\]
Then \(v_n \to \pi\)
The twist is that the fixed “point” is really a vector here, which we interpret as a distribution!

drawing

Note: the fixed point equation is the same as the equation defining eigenvectors and eigenvalues (up to transpose). This explains how I implemented the function compute_stationary_distribution in the earlier R code:

print(compute_stationary_distribution)

## function(matrix) {
##   eig <- eigen(t(M))
##   eigenvalues = eig$values
##   i <- which.max(Re(eigenvalues[abs(Im(eigenvalues)) < 1e-6]))
##   statio <- (eig$vectors)[,i]
##   return(statio)
## }

Quick overview of (finite) Markov chain theory

The result we talked about in the last slide, \(v_n \to \pi\), is called “convergence of the marginals”
It is not quite what we want.
What we want is a Law of Large Number
This turns out to hold too if:
- We have an homogeneous Markov chain that satisfy global balance
- Some notion that we can explore the full state space from any starting point. E.g. in discrete state space, just need “irreducibility”, i.e. there is a positive probability to go from any start state to any end state
Having a Central Limit Theorem is nice to have too:
- gives us a principled way to decide when to end the simulation
- more on this soon, CLT is the theoretical foundations of Effective Sample Size (ESS)
- for discrete Markov chains, CLT holds under the same conditions as LLN (twist is the variance of the asymptotic normal distribution)
A bit more work to describe the conditions for continuous state spaces, might talk a bit more about Markov chain theory later if there is time/interest

Understanding JAGS

Let us start with something we know: approaching the Challenger problem with a decision tree
- Let’s say we have only 2 observations:
  - First one: Temperature, 66; incident, 0
  - Second one: Temperature, 70; incident, 1

Recall: JAGS code and graphical model

slope ~ dnorm(0, 0.001)
intercept ~ dnorm(0, 0.001)

for (i in 1:length(temperature)) {
  p[i] <- ilogit(intercept + slope * temperature[i]) # recall: ilogit = logistic
  F[i] ~ dbern(p[i])
}

drawing

Decision tree

drawing

Note: make sure you understand how the bivariate posterior density plot matches up with the decision tree (what is x axis? y axis)

Computing the posterior

There are (at least) two ways to get to the posterior:

Bayes rule
MCMC/JAGS/other PPLs

Recall: Bayes rule

Compute \(\gamma(x)\) for each path \(x\) consistent with the observations (\(\gamma\) is called the joint)
Normalize to get the posterior: \(\pi(x) = \gamma(x) / Z\). Here \(Z\) involves integrals instead of sums.

We are able to compute \(\gamma(x)\) for any given path, but the problem is that computing \(Z\) is hard.

What JAGS/MCMC allows: computing the posterior when you only know \(\gamma\), not \(Z\).

Quick recap: computing \(\gamma\)

Let’s start with a review of how to compute \(\gamma(x)\) for a given path. I want to convince you that JAGS can do that automatically from the model you wrote.

Precomputation: JAGS first uses a directed graphical model to decide of an ordering of the variables
For each given path \(x\), JAGS then computes chain rule using the precomputed ordering

drawing

Example on board:

recall our data points are:
- First one: Temperature, 66; incident, 0
- Second one: Temperature, 70; incident, 1
\(p[i]\) is given by logistic(slope*temperature_i + intercept) = 1.0/(1.0 + exp(-slope*temp_i-intercept))
let us say the path is (0.05, -0.4, p_1=1.0/(1.0 + exp(-0.05*66+0.4)) = 0.9478464, F_1=0, p_2=...)
what is the probability of this path?

JAGS/slice sampling for computing the posterior when you only know \(\gamma\), not \(Z\)

Output of JAGS/MCMC: a list of samples \(S = (X_1, X_2, \dots, X_n)\) from which we can answer all the questions we care about.

Intuition: JAGS performs a random walk under the density

drawing

Algorithm: how JAGS computes \(S\):

At each step, we JAGS maintain one path in the decision tree \(X_i\)
JAGS “proposes” one change \(X^\star\) for one of the variables, keeping the others fixed
It performs a test based on \(\gamma\) to check if that change should be kept (“accepted”, \(X_{i+1} \gets X^*\)) or not in which case we go back to the previous state \(X_{i+1} \gets X_i\) (“rejected”).
Add \(X_{i+1}\) to \(S\)

Note: because of “rejections” in the last step, \(S\) will actually contain duplicates, so that’s why I have started to use the more precise terminology of a “list” of samples rather than a “set”

More details

Now let us see how JAGS proposes path changes, and then how to make the accept-reject decision

JAGS will propose a change for only one variable at the time
- Question: look at the bivariate posterior density; if I fix one variable and change the other one, what do I get?
- Hint: this will be used to define a “slice”
How JAGS/slice sampler proposal works:
- Look at current point \(x_i\),
  - compute a maximum height \(= \gamma(x_i)\) (length of black bold line)
  - sample height \(h\) uniformly in \([0, \gamma(x_i)]\) (i.e. sample a point in the black bold line)
- Propose to move \(x^\star\) uniformly in a window (green) around the current position, say \(x^\star\) uniform in \([x_i - 1, x_i + 1]\)
- These two steps define a proposed point \((x^\star, h)\). If this proposed is still under the density (example shown as green circle), accept, otherwise (example shown as red circle) reject and stay where you were

drawing

For more on slice sampling (in particular, how to avoid to be sensitive with respect to the size of the window size): see the original slice sampling paper by R. Neal

Relation with Metropolis-Hastings (MH)

The slice sampler proposes two things at each step:
- A new \(x\) position, \(x*\)
- A height \(h\) uniformly from 0 to the max height
Metropolis-Hastings: we can replace the uniform height by a coin flip
How? Let’s consider to cases

Case 1: proposing to go “uphill”, \(\gamma(x^*) \ge \gamma(x)\). In this case, we always accept! See the picture:

drawing

Case 2: proposing to go “downhill”, \(\gamma(x^*) < \gamma(x)\). In this case, we may accept or reject… See the picture:

drawing

We can compute the probability that we accept! It’s the height of the green stick in bold divided by the height of the black stick in bold:

\[\frac{\gamma(x^*)}{\gamma(x_i)}\]

This quantity is called the MH ratio

Generalization: non-uniform proposal

Instead of proposing uniformly on a window \([x_i - 1, x_i + 1]\), we can use other proposal distributions \(q(x^*|x_i)\)
- For example, a normal distribution centered at the current point \(x_i\)
If \(q\) is symmetric, \(q(x|x') = q(x'|x)\) then the MH ratio is identical \(\gamma(x^*)/\gamma(x_i)\)
If not, accept the proposal \(x^*\) with probability

\[\min\left\{1, \frac{\gamma(x^*)}{\gamma(x_i)}\frac{q(x_i|x^*)}{q(x^*|x_i)}\right\}\]

Notice that we indeed get back our simpler formula \(\gamma(x^*)/\gamma(x_i)\) when \(q\) is symmetric.

First failure mode: unidentifiability

drawing

The two plots below show the first 100 iterations from JAGS
- Top one: exploring the unidentifiable posterior based on 10 observations (5 of which are failure)
- Bottom one: exploring the unidentifiable posterior based on 10000 observations (5000 of which are failure)
As you can see, for this problem the posterior gets harder and harder to explore as the size of the dataset increase

Why is this landscape hard to explore for slice sampling?

Understanding the failure

Unidentifiable model: a narrow diagonal tunnel is hard to explore if you are only allowed to move along x or y axis

drawing

Note: other situations may cause this as well, but unidentifiability (or “weak identifiability”) is the most common one

Second failure mode: “multimodality”

Toy example:

Real example: Bayesian clustering
- Details of the model soon
- To start with, let us compare the trace plot for JAGS/Stan vs. Blang

Let us compare to a more powerful sampler: parallel tempering via Blang

drawing

Understanding the failure

Multimodal: regions disconnected or only connected through a region of low “height” (density) makes exploration hard

drawing

Solutions to these two failure modes

Hamiltonian Monte Carlo
- e.g. implemented in Stan; only works for continuous models
- will generally address the unidentifiability failure mode, not always the multimodal one
Parallel Tempering
- implemented in Blang; works for continuous and discrete
- solves both problems but may need many cores to do it quickly

Discussion on last week example: change point problem / segmentation

In a healthy cell, chromosomes (long strands of DNA) come in pairs (except for the sex chromosomes).
In certain cancers, some parts of the chromosome come to lose one of the two copies.
Suppose we segment one (non-sex) chromosome into segments \(0, 1, 2, \dots, 49\).
For each segment, let \(S_i\) encodes the number of copies in this segment.
Examples of possible realizations for the copy numbers:

drawing

Change point problem

Goal of inference: copy numbers, which are hard to observe directly.
Instead, use sequencing data.
- DNA of many cells is pooled and cut into small pieces
- sequencing machines read some of these small pieces and map into one of the 50 segments.
Key point: if a segment has two copies, there should be on average twice as many reads mapped into it compared to a segment with only one copy.
Let \(K\) be the average number of reads we get per segment and per chromosome.
- Could be determined experimentally or inferred, let say we known \(K=3.5\).

Data looks like this:

drawing

Can you build a Bayesian model for that?

How to pick distributions?

Use the type of the realization to narrow down the possibilities
- Example: for the Poisson emissions, we knew the support had to be \(\{0, 1, 2, 3, \dots\}\); this excludes Bernoulli and Uniform.
Exploratory data analysis
Ask experts/check literature if possible
Theoretical grounds (asympotic theory, mechanistic models, etc)
- Example: “law of rare events” provides some interpretation
Try several methods and use Bayesian model selection techniques (next week, if time permits)

Frequently used distributions

By type:

Binary data types
- Bernoulli (by definition)
Enumeration data types
- Categorical (by definition)
Integers \(\ge 0\)
- Geometric (mode at zero)
- Poisson (mode close to mean)
- For more detailed models, Negative Binomial (2 parameters)
Continuous \(\ge 0\):
- Exponential (mode at zero)
- For more detailed models, Gamma
Continuous in \([0, 1]\):
- Continuous Uniform
- For more detailed models, Beta
Continuous on real line:
- Normal

By interpretation:

Sum of Bernoullis:
- Binomial
Population survey:
- Hyper Geometric
Waiting times:
- Geometric, Negative Binomial

Mathematical model

Let \(S_0, S_1, \dots, S_{49}\) denote a list of random variable modelling the unobserved states.
- Each state takes two possible values (the “copy number state”).
- We want to encourage a priori the states to “stick” to the same value (i.e. the copy number does not change too often as we move along the genome), so we set \[P(S_i = s | S_{i-1} = s) = 0.95\]
- Exercise (part 1): how to write this in our mathematical notation, \(S_i | S_{i-1} \sim ???\) (hint: use Bernoullis and change the encoding of the states as 0 and 1 instead of 1 and 2)
- Exercise (part 2): write a JAGS model (hint: check “ifelse(x, a, b)” in the JAGS manual
- Also have a look at the Blang model for comparison
Let \(Y_0, Y_1, \dots, Y_{49}\) denote the observations (number of reads mapped for each of the 50 segments).
We will use the following model (a common choice in bioinformatics): \[Y_i | S_i \sim \text{Poisson}(3.5 (S_i + 1))\]
Recall: Poisson distribution
Exercise (part 3): draw graphical model
Recall \(K\) is the average number of reads we get per segment and per chromosome.
- How to proceed is \(K\) is not known?
- Part 4: Implement this idea in Blang as follows:
  - In the variable declarations, make \(K\) random instead of param, and initialize it with ?: latentReal intead of fixedReal(3.5)
  - Add a prior to \(K\) (see distributions reference)
- Graphical model of the new model?

STAT 520 - Bayesian Analysis

Goals

Logistics

Recap: hierarchical model

Recap: sensitivity heuristic

JAGS under the hood

First failure mode: unidentifiability

MCMC: failure modes

Second failure mode: “multimodality”

Prerequisite to understand slice sampling and MH

Idea 1 (incorrect)

Idea 2: propose + accept reject

Understanding why idea/walk 1 fails while idea/walk 2 works

Occupancy vector and transition matrix

Stationarity

Quick overview of (finite) Markov chain theory

Understanding JAGS

Recall: JAGS code and graphical model

Decision tree

Computing the posterior

Quick recap: computing \(\gamma\)

JAGS/slice sampling for computing the posterior when you only know \(\gamma\), not \(Z\)

More details

Relation with Metropolis-Hastings (MH)

Generalization: non-uniform proposal

First failure mode: unidentifiability

Understanding the failure

Second failure mode: “multimodality”

Understanding the failure

Solutions to these two failure modes

Discussion on last week example: change point problem / segmentation

Change point problem

How to pick distributions?

Frequently used distributions

Mathematical model