STAT 520 - Bayesian Analysis

Alexandre Bouchard-Côté

3/18/2019

Goals

Today:

Logistics

Updated schedule with some readings

Recall: Exercise 2 due tonight!

Recap: hierarchical model

drawing

In our case: just make \(\mu\) and \(s\) random! (or equivalently, \(\alpha\) and \(\beta\))

Recap: sensitivity heuristic

It seems we have introduced new problems as now we again have hyperparameters, namely those for the priors on \(\mu\) and \(s\). Here we picked \(\mu \sim {\text{Beta}}(1,1) = {\text{Unif}}(0, 1)\), \(s \sim \text{Exponential}(1/10000)\)

Key point: yes, but now we are less sensitive to these choices!

Why? Heuristic: say you have a random variable connected to some hyper-parameters (grey squares) and random variables connected to data (circles)

Before going hierarchical: for maiden/early flights we had

drawing

After going hierarchical:

drawing

JAGS under the hood

Key technique used by JAGS: slice sampling.

Why should we look under the hood? To understand the failure modes and how to address them.

Things that the slice sampler has problem with (failure modes):

Solutions to these problems:

First failure mode: unidentifiability

Recall our unidentifiable model:

model Unidentifiable {
  ...
  laws {
    p ~ ContinuousUniform(0, 1)
    p2 ~ ContinuousUniform(0, 1)
    nFails | p, p2, nLaunches ~ Binomial(nLaunches, p * p2)
  }
}

It gave us the following posterior:

drawing

MCMC: failure modes

Unidentifiability creates computational difficulties.

Example:

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  require(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
require(rjags)
## Loading required package: rjags
## Loading required package: coda
## Linked to JAGS 4.2.0
## Loaded modules: basemod,bugs
require(coda)
require(ggplot2)

modelstring="
  model {
  
    p1 ~ dunif(0, 1)
    p2 ~ dunif(0, 1)
  
    x ~ dbin(p1 * p2, n)
  }
"

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
plots <- list()
i <- 1
for (n in c(10,10000)) {local({
  my.seed <- 1 
  set.seed(my.seed) 
  inits <- list(.RNG.name = "base::Mersenne-Twister",
                .RNG.seed = my.seed)
  
  model <- jags.model(
    textConnection(modelstring), 
    data = list( # Pass on the data:
      'n' = n,
      'x' = n/2), 
    inits=inits) 
  
  samples <- 
    coda.samples(model,
                 c('p1', 'p2'), # These are the variables we want to monitor (plot, etc)
                 100) # number of MCMC iterations

  
  chain <- samples[[1]]
  p1 <- chain[,1]
  p2 <- chain[,2]
  df <- data.frame("p1" = as.numeric(p1), "p2" = as.numeric(p2))
  
  p <- ggplot(df, aes(p1, p2)) +
    xlim(0, 1) +
    ylim(0, 1) +
    geom_path()
  plots[[i]] <<- p

  })
  i <- i + 1
}
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 2
##    Total graph size: 9
## 
## Initializing model
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 2
##    Total graph size: 9
## 
## Initializing model
multiplot(plotlist = plots)
## Loading required package: grid

Second failure mode: “multimodality”

N <- 100000

components <- sample(1:3,prob=c(0.3,0.5,0.2),size=N,replace=TRUE)
mus <- c(0,10,3)
sds <- sqrt(c(1,1,0.1))

samples <- rnorm(n=N,mean=mus[components],sd=sds[components])
data <- data.frame(samples)
ggplot(data, aes(samples)) + geom_density() + theme_bw()

require(rjags)
require(ggplot2)
require(coda)
require(ggmcmc) # Note: nice package converting JAGS output into tidy format + ggplot-based trace/posterior histograms/etc
## Loading required package: ggmcmc
modelstring="
  model {
    for (k in 1:2) {
      mu[k] ~ dnorm(0, .0001)
      sigma[k] ~ dunif(0, 100)
      alpha[k] <- 1
    }
    pi ~ ddirch(alpha)
    
    for (n in 1:length(observation)) {
      z[n] ~ dcat(pi)
      observation[n] ~ dnorm(mu[z[n]], sigma[z[n]])
    }
  }
"

data <- read.csv( file = "mixture_data_with_header.csv", header = TRUE)

# Make the simulations reproducible
my.seed <- 1 # 1 is arbitrary. The rest of the simulation is deterministic given that initial seed.
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  textConnection(modelstring), 
  data = data, 
  inits = inits) 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 300
##    Unobserved stochastic nodes: 305
##    Total graph size: 1216
## 
## Initializing model
samples <- 
  coda.samples(model,
               c('mu'), # These are the variables we want to monitor (plot, compute expectations, etc)
               10000) # number of MCMC iterations

tidy <- ggs(samples)
ggs_traceplot(tidy) + ylim(100,210) + theme_bw()

drawing

Prerequisite to understand slice sampling and MH

Question: how to design a sampler for a uniform distribution on a set \(A\), assuming only pointwise evaluation?

Recall: pointwise evaluation means all you can do is: “given \(x\), answer if \(x \in A\)”.

Example: \(A = \{0, 1, 2\}\)

Idea: use a random walk so that it scales to complicated/high dimensional \(A\)

Idea 1 (incorrect)

Move to one neighbor at random.

compute_stationary_distribution <- function(matrix) {
  eig <- eigen(t(M))
  eigenvalues = eig$values
  i <- which.max(Re(eigenvalues[abs(Im(eigenvalues)) < 1e-6]))
  statio <- (eig$vectors)[,i]
  return(statio)
}

M = matrix(c(
  0,   1, 0,
  1/2, 0, 1/2,
  0,   1, 0
), nrow=3, ncol=3, byrow=T)

barplot(compute_stationary_distribution(M))

Idea 2: propose + accept reject

M = matrix(c(
  1/2,   1/2, 0,
  1/2,   0,   1/2,
  0,     1/2, 1/2
), nrow=3, ncol=3, byrow=T)

barplot(compute_stationary_distribution(M))

Understanding why idea/walk 1 fails while idea/walk 2 works

Occupancy vector and transition matrix

Stationarity

drawing

Note: the fixed point equation is the same as the equation defining eigenvectors and eigenvalues (up to transpose). This explains how I implemented the function compute_stationary_distribution in the earlier R code:

print(compute_stationary_distribution)
## function(matrix) {
##   eig <- eigen(t(M))
##   eigenvalues = eig$values
##   i <- which.max(Re(eigenvalues[abs(Im(eigenvalues)) < 1e-6]))
##   statio <- (eig$vectors)[,i]
##   return(statio)
## }

Quick overview of (finite) Markov chain theory

Understanding JAGS

Recall: JAGS code and graphical model

slope ~ dnorm(0, 0.001)
intercept ~ dnorm(0, 0.001)

for (i in 1:length(temperature)) {
  p[i] <- ilogit(intercept + slope * temperature[i]) # recall: ilogit = logistic
  F[i] ~ dbern(p[i])
}

drawing

Decision tree

drawing

Note: make sure you understand how the bivariate posterior density plot matches up with the decision tree (what is x axis? y axis)

Computing the posterior

There are (at least) two ways to get to the posterior:

Recall: Bayes rule

We are able to compute \(\gamma(x)\) for any given path, but the problem is that computing \(Z\) is hard.

What JAGS/MCMC allows: computing the posterior when you only know \(\gamma\), not \(Z\).

Quick recap: computing \(\gamma\)

Let’s start with a review of how to compute \(\gamma(x)\) for a given path. I want to convince you that JAGS can do that automatically from the model you wrote.

drawing

Example on board:

JAGS/slice sampling for computing the posterior when you only know \(\gamma\), not \(Z\)

Output of JAGS/MCMC: a list of samples \(S = (X_1, X_2, \dots, X_n)\) from which we can answer all the questions we care about.

Intuition: JAGS performs a random walk under the density

drawing

Algorithm: how JAGS computes \(S\):

Note: because of “rejections” in the last step, \(S\) will actually contain duplicates, so that’s why I have started to use the more precise terminology of a “list” of samples rather than a “set”

More details

Now let us see how JAGS proposes path changes, and then how to make the accept-reject decision

drawing

For more on slice sampling (in particular, how to avoid to be sensitive with respect to the size of the window size): see the original slice sampling paper by R. Neal

Relation with Metropolis-Hastings (MH)

Case 1: proposing to go “uphill”, \(\gamma(x^*) \ge \gamma(x)\). In this case, we always accept! See the picture:

drawing

Case 2: proposing to go “downhill”, \(\gamma(x^*) < \gamma(x)\). In this case, we may accept or reject… See the picture:

drawing

We can compute the probability that we accept! It’s the height of the green stick in bold divided by the height of the black stick in bold:

\[\frac{\gamma(x^*)}{\gamma(x_i)}\]

This quantity is called the MH ratio

Generalization: non-uniform proposal

\[\min\left\{1, \frac{\gamma(x^*)}{\gamma(x_i)}\frac{q(x_i|x^*)}{q(x^*|x_i)}\right\}\]

Notice that we indeed get back our simpler formula \(\gamma(x^*)/\gamma(x_i)\) when \(q\) is symmetric.

First failure mode: unidentifiability

drawing

Why is this landscape hard to explore for slice sampling?

Understanding the failure

drawing

Note: other situations may cause this as well, but unidentifiability (or “weak identifiability”) is the most common one

Second failure mode: “multimodality”

drawing

Understanding the failure

drawing

Solutions to these two failure modes

Discussion on last week example: change point problem / segmentation

drawing

Change point problem

Data looks like this:

drawing

Can you build a Bayesian model for that?

How to pick distributions?

Frequently used distributions

By type:

By interpretation:

Mathematical model