STAT 520 - Bayesian Analysis
Alexandre Bouchard-Côté
3/18/2019
Goals
Today:
- Into Bayesian computational methods
- Motivation
- Slice sampling
- Metropolis-Hastings via slicing
- Limitations of slice sampling and MH
- Next lecture: alternatives (parallel tempering and HMC)
Recap: hierarchical model
In our case: just make \(\mu\) and \(s\) random! (or equivalently, \(\alpha\) and \(\beta\))
- Share these two “global parameters” across all launch types \[p_i | \mu, s \sim {\text{Beta}}(\mu s, (1 - \mu) s)\]
- We still need to put prior on \(\mu\) and \(s\)
- But you should expect this prior choice to be less sensitive. Why? Hint: look at the number of outgoing edges in the graphical model. Compare with number of outgoing edges for the “maiden flight” (see next slide)
- Example: \(\mu \sim {\text{Beta}}(1,1) = {\text{Unif}}(0, 1)\), \(s \sim \text{Exponential}(1/10000)\) (why such a small value? Hint: mean vs. rate parameter)
Recap: sensitivity heuristic
It seems we have introduced new problems as now we again have hyperparameters, namely those for the priors on \(\mu\) and \(s\). Here we picked \(\mu \sim {\text{Beta}}(1,1) = {\text{Unif}}(0, 1)\), \(s \sim \text{Exponential}(1/10000)\)
Key point: yes, but now we are less sensitive to these choices!
Why? Heuristic: say you have a random variable connected to some hyper-parameters (grey squares) and random variables connected to data (circles)
- If most of the connections are hyper-parameters: will probably be relatively more sensitive
- If there are many more connections to random variables compared to hyper-parameters: will probably be relatively insensitive
Before going hierarchical: for maiden/early flights we had
After going hierarchical:
JAGS under the hood
Key technique used by JAGS: slice sampling.
Why should we look under the hood? To understand the failure modes and how to address them.
Things that the slice sampler has problem with (failure modes):
- “Unidentifiable models”
- “Multimodal problems”
Solutions to these problems:
- Hamiltonian Monte Carlo
- e.g. implemented in
Stan
; only works for continuous models
- will only solve the first failure mode
- Parallel Tempering
- implemented in
Blang
; works for continuous and discrete
- solves both problems but may need many cores to do it quickly
First failure mode: unidentifiability
Recall our unidentifiable model:
model Unidentifiable {
...
laws {
p ~ ContinuousUniform(0, 1)
p2 ~ ContinuousUniform(0, 1)
nFails | p, p2, nLaunches ~ Binomial(nLaunches, p * p2)
}
}
It gave us the following posterior:
MCMC: failure modes
Unidentifiability creates computational difficulties.
Example:
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
require(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
## Loading required package: rjags
## Loading required package: coda
## Linked to JAGS 4.2.0
## Loaded modules: basemod,bugs
require(coda)
require(ggplot2)
modelstring="
model {
p1 ~ dunif(0, 1)
p2 ~ dunif(0, 1)
x ~ dbin(p1 * p2, n)
}
"
# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is
# deterministic given that initial seed.
plots <- list()
i <- 1
for (n in c(10,10000)) {local({
my.seed <- 1
set.seed(my.seed)
inits <- list(.RNG.name = "base::Mersenne-Twister",
.RNG.seed = my.seed)
model <- jags.model(
textConnection(modelstring),
data = list( # Pass on the data:
'n' = n,
'x' = n/2),
inits=inits)
samples <-
coda.samples(model,
c('p1', 'p2'), # These are the variables we want to monitor (plot, etc)
100) # number of MCMC iterations
chain <- samples[[1]]
p1 <- chain[,1]
p2 <- chain[,2]
df <- data.frame("p1" = as.numeric(p1), "p2" = as.numeric(p2))
p <- ggplot(df, aes(p1, p2)) +
xlim(0, 1) +
ylim(0, 1) +
geom_path()
plots[[i]] <<- p
})
i <- i + 1
}
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 2
## Total graph size: 9
##
## Initializing model
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 2
## Total graph size: 9
##
## Initializing model
multiplot(plotlist = plots)
## Loading required package: grid
- Both plots show the first 100 iterations from JAGS
- Top one: exploring the unidentifiable posterior based on 10 observations (5 of which are failure)
- Bottom one: exploring the unidentifiable posterior based on 10000 observations (5000 of which are failure)
- As you can see, for this problem the posterior gets harder and harder to explore as the size of the dataset increase
- Why? More on this soon after we go over slice sampling.
Second failure mode: “multimodality”
N <- 100000
components <- sample(1:3,prob=c(0.3,0.5,0.2),size=N,replace=TRUE)
mus <- c(0,10,3)
sds <- sqrt(c(1,1,0.1))
samples <- rnorm(n=N,mean=mus[components],sd=sds[components])
data <- data.frame(samples)
ggplot(data, aes(samples)) + geom_density() + theme_bw()
- Real example: Bayesian clustering
- Details of the model soon
- To start with, let us compare the trace plot for JAGS/Stan vs. Blang
require(rjags)
require(ggplot2)
require(coda)
require(ggmcmc) # Note: nice package converting JAGS output into tidy format + ggplot-based trace/posterior histograms/etc
## Loading required package: ggmcmc
modelstring="
model {
for (k in 1:2) {
mu[k] ~ dnorm(0, .0001)
sigma[k] ~ dunif(0, 100)
alpha[k] <- 1
}
pi ~ ddirch(alpha)
for (n in 1:length(observation)) {
z[n] ~ dcat(pi)
observation[n] ~ dnorm(mu[z[n]], sigma[z[n]])
}
}
"
data <- read.csv( file = "mixture_data_with_header.csv", header = TRUE)
# Make the simulations reproducible
my.seed <- 1 # 1 is arbitrary. The rest of the simulation is deterministic given that initial seed.
set.seed(my.seed)
inits <- list(.RNG.name = "base::Mersenne-Twister",
.RNG.seed = my.seed)
model <- jags.model(
textConnection(modelstring),
data = data,
inits = inits)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 300
## Unobserved stochastic nodes: 305
## Total graph size: 1216
##
## Initializing model
samples <-
coda.samples(model,
c('mu'), # These are the variables we want to monitor (plot, compute expectations, etc)
10000) # number of MCMC iterations
tidy <- ggs(samples)
ggs_traceplot(tidy) + ylim(100,210) + theme_bw()
- Let us compare to a more powerful sampler: parallel tempering via Blang
Prerequisite to understand slice sampling and MH
Question: how to design a sampler for a uniform distribution on a set \(A\), assuming only pointwise evaluation?
Recall: pointwise evaluation means all you can do is: “given \(x\), answer if \(x \in A\)”.
Example: \(A = \{0, 1, 2\}\)
Idea: use a random walk so that it scales to complicated/high dimensional \(A\)
Idea 1 (incorrect)
Move to one neighbor at random.
compute_stationary_distribution <- function(matrix) {
eig <- eigen(t(M))
eigenvalues = eig$values
i <- which.max(Re(eigenvalues[abs(Im(eigenvalues)) < 1e-6]))
statio <- (eig$vectors)[,i]
return(statio)
}
M = matrix(c(
0, 1, 0,
1/2, 0, 1/2,
0, 1, 0
), nrow=3, ncol=3, byrow=T)
barplot(compute_stationary_distribution(M))
- This code will be explained soon.
- For now, if you are not convinced, simulate your own Markov chain and build a histogram, you will recover the above picture.
Idea 2: propose + accept reject
- Say we are at position \(x_t \in A\) at iteration \(t\)
- Propose to go to left or right with equal probability. Get proposal \(x^*\)
- If \(x^* \in A\), set \(x_{t+1} = x^*\) (accept)
- Else, set \(x_{t+1} = x_t\) (reject)
M = matrix(c(
1/2, 1/2, 0,
1/2, 0, 1/2,
0, 1/2, 1/2
), nrow=3, ncol=3, byrow=T)
barplot(compute_stationary_distribution(M))
Understanding why idea/walk 1 fails while idea/walk 2 works
- Think of a very large population of people each with a laptop running MCMC
- There are 3 rooms corresponding to the 3 states \(A = \{0, 1, 2\}\)
- Depending on the state on their laptop, each person goes in the corresponding room
Occupancy vector and transition matrix
- Room occupancies can be organized into a vector. Normalize by number of people and get a distribution.
- If you start with occupancies \(v\) and let everyone run their MCMC for one step and do their room change, you get a new occupancy \(v'\)
- Exercise: convince yourself \(v' = M v\) where row \(i\) of \(M\) is the transition probabilities for an MCMC at state \(i\). I.e. \[M_{i,j} = {\mathbb{P}}(X_{t+1} = j | X_t = i)\]
Stationarity
- The distribution of interest is \(\pi = (1/3, 1/3, 1/3)\).
- Let us try to build \(M\) such that if we start with an occupancy given by the distribution of interest, \(v = \pi\), then after one step that occupancy is exactly preserved, \[v' = M \pi \Rightarrow v' = \pi\] (this is called global balance)
- For ‘idea/walk 2’ this is easy to see: between any pair of rooms, \(i\) and \(j\) the number of people going from \(i\) to \(j\) is the same as the number going from \(j\) to \(i\) (this is an instance of detailed balance/reversibility)
- Exercise: show ‘idea/walk 1’ does not satisfy global balance whereas ‘idea/walk 2’ does
- If we have \[\pi = M \pi,\] this means \(\pi\) is a fixed point of applying \(M\).
- Then a general principle in maths is that if you repeatedly apply a map under certain condition you will converge to a fixed point of that map, i.e. say everyone starts at zero: \(v_0 = (1, 0, 0)\), define \[v_n = M^n v_0\]
- Then \(v_n \to \pi\)
- The twist is that the fixed “point” is really a vector here, which we interpret as a distribution!
Note: the fixed point equation is the same as the equation defining eigenvectors and eigenvalues (up to transpose). This explains how I implemented the function compute_stationary_distribution
in the earlier R code:
print(compute_stationary_distribution)
## function(matrix) {
## eig <- eigen(t(M))
## eigenvalues = eig$values
## i <- which.max(Re(eigenvalues[abs(Im(eigenvalues)) < 1e-6]))
## statio <- (eig$vectors)[,i]
## return(statio)
## }
Quick overview of (finite) Markov chain theory
- The result we talked about in the last slide, \(v_n \to \pi\), is called “convergence of the marginals”
- It is not quite what we want.
- What we want is a Law of Large Number
- This turns out to hold too if:
- We have an homogeneous Markov chain that satisfy global balance
- Some notion that we can explore the full state space from any starting point. E.g. in discrete state space, just need “irreducibility”, i.e. there is a positive probability to go from any start state to any end state
- Having a Central Limit Theorem is nice to have too:
- gives us a principled way to decide when to end the simulation
- more on this soon, CLT is the theoretical foundations of Effective Sample Size (ESS)
- for discrete Markov chains, CLT holds under the same conditions as LLN (twist is the variance of the asymptotic normal distribution)
- A bit more work to describe the conditions for continuous state spaces, might talk a bit more about Markov chain theory later if there is time/interest
Understanding JAGS
- Let us start with something we know: approaching the Challenger problem with a decision tree
- Let’s say we have only 2 observations:
- First one: Temperature, 66; incident, 0
- Second one: Temperature, 70; incident, 1
Recall: JAGS code and graphical model
slope ~ dnorm(0, 0.001)
intercept ~ dnorm(0, 0.001)
for (i in 1:length(temperature)) {
p[i] <- ilogit(intercept + slope * temperature[i]) # recall: ilogit = logistic
F[i] ~ dbern(p[i])
}
Decision tree
Note: make sure you understand how the bivariate posterior density plot matches up with the decision tree (what is x axis? y axis)
Computing the posterior
There are (at least) two ways to get to the posterior:
- Bayes rule
- MCMC/JAGS/other PPLs
Recall: Bayes rule
- Compute \(\gamma(x)\) for each path \(x\) consistent with the observations (\(\gamma\) is called the joint)
- Normalize to get the posterior: \(\pi(x) = \gamma(x) / Z\). Here \(Z\) involves integrals instead of sums.
We are able to compute \(\gamma(x)\) for any given path, but the problem is that computing \(Z\) is hard.
What JAGS/MCMC allows: computing the posterior when you only know \(\gamma\), not \(Z\).
Quick recap: computing \(\gamma\)
Let’s start with a review of how to compute \(\gamma(x)\) for a given path. I want to convince you that JAGS can do that automatically from the model you wrote.
- Precomputation: JAGS first uses a directed graphical model to decide of an ordering of the variables
- For each given path \(x\), JAGS then computes chain rule using the precomputed ordering
Example on board:
- recall our data points are:
- First one: Temperature, 66; incident, 0
- Second one: Temperature, 70; incident, 1
- \(p[i]\) is given by
logistic(slope*temperature_i + intercept) = 1.0/(1.0 + exp(-slope*temp_i-intercept))
- let us say the path is
(0.05, -0.4, p_1=1.0/(1.0 + exp(-0.05*66+0.4)) = 0.9478464, F_1=0, p_2=...)
- what is the probability of this path?
JAGS/slice sampling for computing the posterior when you only know \(\gamma\), not \(Z\)
Output of JAGS/MCMC: a list of samples \(S = (X_1, X_2, \dots, X_n)\) from which we can answer all the questions we care about.
Intuition: JAGS performs a random walk under the density
Algorithm: how JAGS computes \(S\):
- At each step, we JAGS maintain one path in the decision tree \(X_i\)
- JAGS “proposes” one change \(X^\star\) for one of the variables, keeping the others fixed
- It performs a test based on \(\gamma\) to check if that change should be kept (“accepted”, \(X_{i+1} \gets X^*\)) or not in which case we go back to the previous state \(X_{i+1} \gets X_i\) (“rejected”).
- Add \(X_{i+1}\) to \(S\)
Note: because of “rejections” in the last step, \(S\) will actually contain duplicates, so that’s why I have started to use the more precise terminology of a “list” of samples rather than a “set”
More details
Now let us see how JAGS proposes path changes, and then how to make the accept-reject decision
- JAGS will propose a change for only one variable at the time
- Question: look at the bivariate posterior density; if I fix one variable and change the other one, what do I get?
- Hint: this will be used to define a “slice”
- How JAGS/slice sampler proposal works:
- Look at current point \(x_i\),
- compute a maximum height \(= \gamma(x_i)\) (length of black bold line)
- sample height \(h\) uniformly in \([0, \gamma(x_i)]\) (i.e. sample a point in the black bold line)
- Propose to move \(x^\star\) uniformly in a window (green) around the current position, say \(x^\star\) uniform in \([x_i - 1, x_i + 1]\)
- These two steps define a proposed point \((x^\star, h)\). If this proposed is still under the density (example shown as green circle), accept, otherwise (example shown as red circle) reject and stay where you were
For more on slice sampling (in particular, how to avoid to be sensitive with respect to the size of the window size): see the original slice sampling paper by R. Neal
Relation with Metropolis-Hastings (MH)
- The slice sampler proposes two things at each step:
- A new \(x\) position, \(x*\)
- A height \(h\) uniformly from 0 to the max height
- Metropolis-Hastings: we can replace the uniform height by a coin flip
- How? Let’s consider to cases
Case 1: proposing to go “uphill”, \(\gamma(x^*) \ge \gamma(x)\). In this case, we always accept! See the picture:
Case 2: proposing to go “downhill”, \(\gamma(x^*) < \gamma(x)\). In this case, we may accept or reject… See the picture:
We can compute the probability that we accept! It’s the height of the green stick in bold divided by the height of the black stick in bold:
\[\frac{\gamma(x^*)}{\gamma(x_i)}\]
This quantity is called the MH ratio
First failure mode: unidentifiability
- The two plots below show the first 100 iterations from JAGS
- Top one: exploring the unidentifiable posterior based on 10 observations (5 of which are failure)
- Bottom one: exploring the unidentifiable posterior based on 10000 observations (5000 of which are failure)
- As you can see, for this problem the posterior gets harder and harder to explore as the size of the dataset increase
Why is this landscape hard to explore for slice sampling?
Understanding the failure
- Unidentifiable model: a narrow diagonal tunnel is hard to explore if you are only allowed to move along x or y axis
Note: other situations may cause this as well, but unidentifiability (or “weak identifiability”) is the most common one
Second failure mode: “multimodality”
- Real example: Bayesian clustering
- Details of the model soon
- To start with, let us compare the trace plot for JAGS/Stan vs. Blang
- Let us compare to a more powerful sampler: parallel tempering via Blang
Understanding the failure
- Multimodal: regions disconnected or only connected through a region of low “height” (density) makes exploration hard
Solutions to these two failure modes
- Hamiltonian Monte Carlo
- e.g. implemented in
Stan
; only works for continuous models
- will generally address the unidentifiability failure mode, not always the multimodal one
- Parallel Tempering
- implemented in
Blang
; works for continuous and discrete
- solves both problems but may need many cores to do it quickly
Discussion on last week example: change point problem / segmentation
- In a healthy cell, chromosomes (long strands of DNA) come in pairs (except for the sex chromosomes).
- In certain cancers, some parts of the chromosome come to lose one of the two copies.
- Suppose we segment one (non-sex) chromosome into segments \(0, 1, 2, \dots, 49\).
- For each segment, let \(S_i\) encodes the number of copies in this segment.
- Examples of possible realizations for the copy numbers:
Change point problem
- Goal of inference: copy numbers, which are hard to observe directly.
- Instead, use sequencing data.
- DNA of many cells is pooled and cut into small pieces
- sequencing machines read some of these small pieces and map into one of the 50 segments.
- Key point: if a segment has two copies, there should be on average twice as many reads mapped into it compared to a segment with only one copy.
- Let \(K\) be the average number of reads we get per segment and per chromosome.
- Could be determined experimentally or inferred, let say we known \(K=3.5\).
Data looks like this:
Can you build a Bayesian model for that?
How to pick distributions?
- Use the type of the realization to narrow down the possibilities
- Example: for the Poisson emissions, we knew the support had to be \(\{0, 1, 2, 3, \dots\}\); this excludes Bernoulli and Uniform.
- Exploratory data analysis
- Ask experts/check literature if possible
- Theoretical grounds (asympotic theory, mechanistic models, etc)
- Example: “law of rare events” provides some interpretation
- Try several methods and use Bayesian model selection techniques (next week, if time permits)
Frequently used distributions
By type:
- Binary data types
- Bernoulli (by definition)
- Enumeration data types
- Categorical (by definition)
- Integers \(\ge 0\)
- Geometric (mode at zero)
- Poisson (mode close to mean)
- For more detailed models, Negative Binomial (2 parameters)
- Continuous \(\ge 0\):
- Continuous in \([0, 1]\):
- Continuous Uniform
- For more detailed models, Beta
- Continuous on real line:
By interpretation:
- Sum of Bernoullis:
- Population survey:
- Waiting times:
- Geometric, Negative Binomial
Mathematical model
- Let \(S_0, S_1, \dots, S_{49}\) denote a list of random variable modelling the unobserved states.
- Each state takes two possible values (the “copy number state”).
- We want to encourage a priori the states to “stick” to the same value (i.e. the copy number does not change too often as we move along the genome), so we set \[P(S_i = s | S_{i-1} = s) = 0.95\]
- Exercise (part 1): how to write this in our mathematical notation, \(S_i | S_{i-1} \sim ???\) (hint: use Bernoullis and change the encoding of the states as 0 and 1 instead of 1 and 2)
- Exercise (part 2): write a JAGS model (hint: check “ifelse(x, a, b)” in the JAGS manual
- Also have a look at the Blang model for comparison
- Let \(Y_0, Y_1, \dots, Y_{49}\) denote the observations (number of reads mapped for each of the 50 segments).
- We will use the following model (a common choice in bioinformatics): \[Y_i | S_i \sim \text{Poisson}(3.5 (S_i + 1))\]
- Recall: Poisson distribution
- Exercise (part 3): draw graphical model
- Recall \(K\) is the average number of reads we get per segment and per chromosome.
- How to proceed is \(K\) is not known?
- Part 4: Implement this idea in Blang as follows:
- In the variable declarations, make \(K\)
random
instead of param
, and initialize it with ?: latentReal
intead of fixedReal(3.5)
- Add a prior to \(K\) (see distributions reference)
- Graphical model of the new model?