STAT 520 - Bayesian Analysis

Alexandre Bouchard-Côté

3/13/2019

Goals

Today:

Logistics

Updated schedule with some readings

Recall: Exercise 2 due on Monday

Recap: graphical model and JAGS implementation

drawing

Approaches:

model {

  slope ~ dnorm(0, 0.001)
  intercept ~ dnorm(0, 0.001)

  for (i in 1:length(temperature)) {
    # incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
    p[i] <- ilogit(intercept + slope * temperature[i])
    incident[i] ~ dbern(p[i])
  }

  # predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
  predictedPr <- ilogit(intercept + slope * 31)
  predictedBinary ~ dbern(predictedPr)
}
require("rjags")
## Loading required package: rjags
## Loading required package: coda
## Linked to JAGS 4.2.0
## Loaded modules: basemod,bugs
require("coda")
require("ggplot2")

data <- read.csv( file = "challenger_data-train.csv", header = TRUE)

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
my.seed <- 1 
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  'model.bugs', 
  data = list( # Pass on the data:
    'incident' = data$Damage.Incident, 
    'temperature' = data$Temperature), 
  inits=inits) 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 23
##    Unobserved stochastic nodes: 4
##    Total graph size: 111
## 
## Initializing model
samples <- 
  coda.samples(model,
               c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
               100000) # number of MCMC iterations
summary(samples)
## 
## Iterations = 1001:101000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 1e+05 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                    Mean      SD  Naive SE Time-series SE
## intercept       19.8246 9.25184 0.0292569       0.803987
## predictedBinary  0.9930 0.08361 0.0002644       0.001392
## predictedPr      0.9925 0.04013 0.0001269       0.001341
## slope           -0.3032 0.13592 0.0004298       0.011746
## 
## 2. Quantiles for each variable:
## 
##                    2.5%     25%     50%     75%    97.5%
## intercept        5.6143 13.1947 18.4525 25.1527 41.85570
## predictedBinary  1.0000  1.0000  1.0000  1.0000  1.00000
## predictedPr      0.9329  0.9989  0.9999  1.0000  1.00000
## slope           -0.6264 -0.3816 -0.2830 -0.2057 -0.09434
plot(samples)

print(HPDinterval(samples))
## [[1]]
##                      lower       upper
## intercept        4.0054009 38.42484799
## predictedBinary  1.0000000  1.00000000
## predictedPr      0.9723887  1.00000000
## slope           -0.5749028 -0.06893211
## attr(,"Probability")
## [1] 0.95
# "Jailbreaking" from coda into a saner data frame
chain <- samples[[1]]
intercept <- chain[,1]
slope <- chain[,4]
df <- data.frame("intercept" = as.numeric(intercept), "slope" = as.numeric(slope))
summary(df)
##    intercept          slope         
##  Min.   :-2.853   Min.   :-0.98311  
##  1st Qu.:13.195   1st Qu.:-0.38156  
##  Median :18.452   Median :-0.28303  
##  Mean   :19.825   Mean   :-0.30323  
##  3rd Qu.:25.153   3rd Qu.:-0.20573  
##  Max.   :66.041   Max.   : 0.02563
ggplot(df, aes(intercept, slope)) +
  geom_density_2d()

Some things to pay particular attention to:

Recap: A/B testing

In its early days, Google used a method in the spirit of what is described below to tweak their website. Let us say that they want to decide if they should use a white background or a black background (or some other cosmetic feature).

After a short interval of time, they record the following results:

Think about the following questions:

  1. What would be a model suitable for a Bayesian analysis of this problem?
  2. How to make a decision (A or B) based on the MCMC output?
  3. Implementing the idea using JAGS.

Recap: A/B testing JAGS implementation

require("rjags")
require("coda")
require("ggplot2")

modelstring="
  model {
  
    pA ~ dunif(0, 1)
    pB ~ dunif(0, 1)
  
    sA ~ dbin(pA, nA)
    sB ~ dbin(pB, nB)
    
    AisBetterThanB <- ifelse(pA > pB, 1, 0)
  }
"

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
my.seed <- 1 
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  textConnection(modelstring), 
  data = list( # Pass on the data:
    'nA' = 51,
    'sA' = 45, 
    'nB' = 47,
    'sB' = 37), 
  inits=inits) 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 2
##    Unobserved stochastic nodes: 2
##    Total graph size: 16
## 
## Initializing model
samples <- 
  coda.samples(model,
               c('pA', 'pB', 'AisBetterThanB'), # These are the variables we want to monitor (plot, etc)
               10000) # number of MCMC iterations
summary(samples)
## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                  Mean      SD  Naive SE Time-series SE
## AisBetterThanB 0.8904 0.31241 0.0031241      0.0032970
## pA             0.8673 0.04641 0.0004641      0.0004566
## pB             0.7762 0.05888 0.0005888      0.0005689
## 
## 2. Quantiles for each variable:
## 
##                  2.5%    25%    50%    75%  97.5%
## AisBetterThanB 0.0000 1.0000 1.0000 1.0000 1.0000
## pA             0.7620 0.8394 0.8720 0.9007 0.9439
## pB             0.6529 0.7388 0.7799 0.8179 0.8803
plot(samples)

print(HPDinterval(samples))
## [[1]]
##                    lower     upper
## AisBetterThanB 0.0000000 1.0000000
## pA             0.7743609 0.9493196
## pB             0.6578581 0.8831351
## attr(,"Probability")
## [1] 0.95
chain <- samples[[1]]
pA <- chain[,2]
pB <- chain[,3]
df <- data.frame("pA" = as.numeric(pA), "pB" = as.numeric(pB))
summary(df)
##        pA               pB        
##  Min.   :0.5934   Min.   :0.5235  
##  1st Qu.:0.8394   1st Qu.:0.7388  
##  Median :0.8720   Median :0.7799  
##  Mean   :0.8673   Mean   :0.7762  
##  3rd Qu.:0.9007   3rd Qu.:0.8179  
##  Max.   :0.9788   Max.   :0.9465
ggplot(df, aes(pA, pB)) +
  xlim(0, 1) +
  ylim(0, 1) +
  geom_abline(intercept = 0, slope = 1) +
  geom_point(size = 0.3, alpha = 0.3)

ggplot(df, aes(pA, pB)) +
  xlim(0, 1) +
  ylim(0, 1) +
  geom_abline(intercept = 0, slope = 1) +
  geom_density_2d()

\[\frac{\text{# points in lower triangle}}{M} \approx {\mathbb{P}}(p_A > p_B | \text{data}) = \int \int_{x > y} f_{p_A, p_B|\text{data}}(x, y) {\text{d}}x {\text{d}}y\]

Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb

Hierarchical model

Motivation:

single_rocket_spec <- 
"model {
  # The likelihood
  X ~ dbin(p,n)
  
  # Our prior
  p ~ dbeta(a,b)
}"

Key idea: use “side data”" to inform the prior

data <- read.csv("failure_counts.csv")
data %>% 
  head() %>% 
  knitr::kable(floating.environment="sidewaystable")
LV.Type numberOfLaunches numberOfFailures
Aerobee 1 0
Angara A5 1 0
Antares 110 2 0
Antares 120 2 0
Antares 130 1 1
Antares 230 1 0

How to (badly) use side data

First try

LV.Type numberOfLaunches numberOfFailures
Aerobee 1 0
Angara A5 1 0
Antares 110 2 0
Antares 120 2 0
Antares 130 1 1
Antares 230 1 0

Why it is bad

Towards an improved way to use side data

An improved way to use side data (not quite full Bayesian yet!)

counts <- read.csv("failure_counts.csv")
ggplot(counts, aes(x = numberOfFailures / numberOfLaunches)) +
  geom_histogram()

Solution: go fully Bayesian

Recall:

  1. Construct a probability model including
    • random variables for what we will measure/observe
    • random variables for the unknown quantities
      • those we are interested in (“parameters”, “predictions”)
      • others that just help us formulate the problem (“nuisance”, “random effects”).
  2. Compute the posterior distribution conditionally on the actual data at hand
  3. Use the posterior distribution to:
    • make prediction (point estimate)
    • estimate uncertainty (credible intervals)
    • make a decision

drawing

In our case: just make \(\mu\) and \(s\) random! (or equivalently, \(\alpha\) and \(\beta\))

New higher-level hyperparameters = new problems? No we are probably ok!

It seems we have introduced new problems as now we again have hyperparameters, namely those for the priors on \(\mu\) and \(s\). Here we picked \(\mu \sim {\text{Beta}}(1,1) = {\text{Unif}}(0, 1)\), \(s \sim \text{Exponential}(1/10000)\)

Key point: yes, but now we are less sensitive to these choices!

Why? Heuristic: say you have a random variable connected to some hyper-parameters (grey squares) and random variables connected to data (circles)

Before going hierarchical: for maiden/early flights we had

drawing

After going hierarchical:

drawing

Using more information

full <- read.csv("processed.csv")
[1] "    X  X..Launch    Launch.Date..UTC.      COSPAR         PL.Name                        Orig.PL.Name                SATCAT   LV.Type                 LV.S.N            Site                              Suc   Ref                     Suc_bin  Family           Space.Port    Year   Launch.Index"
[2] "-----  -----------  ---------------------  -------------  -----------------------------  --------------------------  -------  ----------------------  ----------------  --------------------------------  ----  ---------------------  --------  ---------------  -----------  -----  -------------"
[3] "    1  1957 ALP     1957 Oct  4 1928:34    1957 ALP 2     1-y ISZ                        PS-1                        S00002   Sputnik 8K71PS          M1-PS             NIIP-5   LC1                      S     Energiya                      1  Sputnik          NIIP          1957              1"
[4] "    2  1957-U01     1957 Oct 17 0505       1957-U01       USAF 88 Charge A               Poulter Pellet              A08258   Aerobee                 USAF 88           HADC     A                        S     EngSci1.58                    1  Aerobee          HADC          1957              1"
[5] "    3  1957 BET     1957 Nov  3 0230:42    1957 BET 1     2-y ISZ                        PS-2                        S00003   Sputnik 8K71PS          M1-2PS            NIIP-5   LC1                      S     Grahn-WWW                     1  Sputnik          NIIP          1957              2"
[6] "    4  1957-F01     1957 Dec  6 1644:35    1957-F01       Vanguard                       Vanguard Test Satellite     F00002   Vanguard                TV-3              CC       LC18A                    F     Vang-ER9948                   0  Vanguard         CC            1957              1"

Taller hierarchies

Next example: change point problem / segmentation

drawing

Change point problem

Data looks like this:

drawing

Can you build a Bayesian model for that?

How to pick distributions?

Frequently used distributions

By type:

By interpretation:

Mathematical model

JAGS under the hood

Key technique used by JAGS: slice sampling.

Why should we look under the hood? To understand the failure modes and how to address them.

Things that the slice sampler has problem with (failure modes):

Solutions to these problems:

First failure mode: unidentifiability

We have done two thought experiments so far:

drawing

model Unidentifiable {
  ...
  laws {
    p ~ ContinuousUniform(0, 1)
    p2 ~ ContinuousUniform(0, 1)
    nFails | p, p2, nLaunches ~ Binomial(nLaunches, p * p2)
  }
}

drawing

drawing

MCMC: failure modes

Unidentifiability creates computational difficulties.

Example:

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  require(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
require(rjags)
require(coda)
require(ggplot2)

modelstring="
  model {
  
    p1 ~ dunif(0, 1)
    p2 ~ dunif(0, 1)
  
    x ~ dbin(p1 * p2, n)
  }
"

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
plots <- list()
i <- 1
for (n in c(10,10000)) {local({
  my.seed <- 1 
  set.seed(my.seed) 
  inits <- list(.RNG.name = "base::Mersenne-Twister",
                .RNG.seed = my.seed)
  
  model <- jags.model(
    textConnection(modelstring), 
    data = list( # Pass on the data:
      'n' = n,
      'x' = n/2), 
    inits=inits) 
  
  samples <- 
    coda.samples(model,
                 c('p1', 'p2'), # These are the variables we want to monitor (plot, etc)
                 100) # number of MCMC iterations

  
  chain <- samples[[1]]
  p1 <- chain[,1]
  p2 <- chain[,2]
  df <- data.frame("p1" = as.numeric(p1), "p2" = as.numeric(p2))
  
  p <- ggplot(df, aes(p1, p2)) +
    xlim(0, 1) +
    ylim(0, 1) +
    geom_path()
  plots[[i]] <<- p

  })
  i <- i + 1
}
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 2
##    Total graph size: 9
## 
## Initializing model
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 2
##    Total graph size: 9
## 
## Initializing model
multiplot(plotlist = plots)
## Loading required package: grid

Second failure mode: “multimodality”

N <- 100000

components <- sample(1:3,prob=c(0.3,0.5,0.2),size=N,replace=TRUE)
mus <- c(0,10,3)
sds <- sqrt(c(1,1,0.1))

samples <- rnorm(n=N,mean=mus[components],sd=sds[components])
data <- data.frame(samples)
ggplot(data, aes(samples)) + geom_density() + theme_bw()

require(rjags)
require(ggplot2)
require(coda)
require(ggmcmc) # Note: nice package converting JAGS output into tidy format + ggplot-based trace/posterior histograms/etc
## Loading required package: ggmcmc
modelstring="
  model {
    for (k in 1:2) {
      mu[k] ~ dnorm(0, .0001)
      sigma[k] ~ dunif(0, 100)
      alpha[k] <- 1
    }
    pi ~ ddirch(alpha)
    
    for (n in 1:length(observation)) {
      z[n] ~ dcat(pi)
      observation[n] ~ dnorm(mu[z[n]], sigma[z[n]])
    }
  }
"

data <- read.csv( file = "mixture_data_with_header.csv", header = TRUE)

# Make the simulations reproducible
my.seed <- 1 # 1 is arbitrary. The rest of the simulation is deterministic given that initial seed.
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  textConnection(modelstring), 
  data = data, 
  inits = inits) 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 300
##    Unobserved stochastic nodes: 305
##    Total graph size: 1216
## 
## Initializing model
samples <- 
  coda.samples(model,
               c('mu'), # These are the variables we want to monitor (plot, compute expectations, etc)
               10000) # number of MCMC iterations

tidy <- ggs(samples)
ggs_traceplot(tidy) + ylim(100,210) + theme_bw()

drawing

Prerequisite to understand slice sampling and MH

Question: how to design a sampler for a uniform distribution on a set \(A\), assuming only pointwise evaluation?

Recall: pointwise evaluation means all you can do is: “given \(x\), answer if \(x \in A\)”.

Example: \(A = \{0, 1, 2\}\)

Idea: use a random walk so that it scales to complicated/high dimensional \(A\)

Idea 1 (incorrect)

Move to one neighbor at random.

compute_stationary_distribution <- function(matrix) {
  eig <- eigen(t(M))
  eigenvalues = eig$values
  i <- which.max(Re(eigenvalues[abs(Im(eigenvalues)) < 1e-6]))
  statio <- (eig$vectors)[,i]
  return(statio)
}

M = matrix(c(
  0,   1, 0,
  1/2, 0, 1/2,
  0,   1, 0
), nrow=3, ncol=3, byrow=T)

barplot(compute_stationary_distribution(M))

Idea 2: propose + accept reject

M = matrix(c(
  1/2,   1/2, 0,
  1/2,   0,   1/2,
  0,     1/2, 1/2
), nrow=3, ncol=3, byrow=T)

barplot(compute_stationary_distribution(M))

Understanding why idea/walk 1 fails while idea/walk 2 works

Occupancy vector and transition matrix

Stationarity

drawing

Note: the fixed point equation is the same as the equation defining eigenvectors and eigenvalues (up to transpose). This explains how I implemented the function compute_stationary_distribution in the earlier R code:

print(compute_stationary_distribution)
## function(matrix) {
##   eig <- eigen(t(M))
##   eigenvalues = eig$values
##   i <- which.max(Re(eigenvalues[abs(Im(eigenvalues)) < 1e-6]))
##   statio <- (eig$vectors)[,i]
##   return(statio)
## }