STAT 520 - Bayesian Analysis

Alexandre Bouchard-Côté

3/11/2019

Goals

Today:

Into more interesting models
- PPL/MCMC as a black box
- Bayesian GLMs
- AB testing
Hierarchical models

Logistics

Updated schedule with some readings

Exercise 2:

Q1a: Implement in JAGS the rocket model from Exercise 1.
Q1b: Run it on each of the Ariane rocket types (separately, but combining the results into one plot) from the rocket launch data. Report HDIs, traceplots, histograms, ESS.
Q1c: Modify the model to use a Beta prior on the unknown parameter and get that version running in JAGS as well.
Q1d: Show the posterior actually has a closed form expression and use this to validate the last result from JAGS.
Q1e: Add a hierarchical version of this model and run it on the same data. Report HDIs, traceplots, histograms, ESS.
Main deliverable: a facet plot with
- one dimension being the rocket type
- the other being the 2 models (separate, hierarchical)
Optional: run overnight to ensure the results are stable.
Optional (Q2): implement a JAGS model for the change point data (last few slides today)
Submit code in Silico, rest by email as usual
- free account is enough for submission, you can also subscribe if you want to run stuff on the cloud, otherwise run locally
- create an account here
- Follow this link after creating an account

Recap: Exchangeable random variables

Recall: notion of equality in distribution \(X_1 {\stackrel{\scriptscriptstyle d}{=}}X_2\)
- …this means the distribution of \(X_1\) is equal to the distribution of \(X_2\)
- …concretely: draw the marginal PMF or density of each random variable, check if they match
- Example where \(X_1 {\stackrel{\scriptscriptstyle d}{=}}X_2\) but \(X_1 \neq X_2\)?
- Highlight to reveal the answer: You flip a coin and say what is on the top \(X_1\). Your friend say what is on the bottom \(X_2\)
Two random variables are exchangeable if \((X_1, X_2) {\stackrel{\scriptscriptstyle d}{=}}(X_2, X_1)\)
- Example: uniform distributions (i.e. such that the bivariate density is a 2d shape).
  - Develop a criterion for checking from the joint density if it’s exchangeable.
    - Highlight to reveal the answer: It should be symmetric with respect to the line \(y = x\)
  - Show the indicator on (a) a square, (b) a circle, and (c) checkers board, all centered at zero lead to exchangeable random variables. Which one is iid?
    - Highlight to reveal the answer: Only the square is iid
- Note: exchangeability implies identical distributions
- Note: this notion is closely related to reversibility
Generalization: \(n\) variables are exchangeable if for all permutations \(\sigma : \{1, 2, \dots, n\} \to \{1, 2, \dots, n\} \in S_n\), \((X_1, X_2, \dots, X_n) {\stackrel{\scriptscriptstyle d}{=}}(X_{\sigma(1)}, X_{\sigma(2)}, \dots, X_{\sigma(n)})\)
Infinite exchangeable: an infinite sequence of random variable \((X_1, X_2, \dots)\) is exchangeable if all finite subsets are exchangeable.

Recap: De Finetti

Theorem: if \((X_1, X_2, \dots)\) are exchangeable Bernoulli random variables, then there exists a \(\theta\), a prior on \(\theta\), and a likelihood such that the observations are iid given \(\theta\).

Note: This is not guaranteed to hold for finite \(n\) (e.g., works for the checker board, not possible for the circle)
Leads to a philosophy: “Stochastic process based design of Bayesian models”
- Construct models containing an infinite number of observations a priori
- Condition on only the subset you actually observed
- Ideally, construct the model such that you can analytically marginalize the infinite suffix of unobserved data
  - Note: it is trivial to marginalization a node \(X_i\) in a graphical model such that there is no directed path from \(X_i\) to any observation

JAGS as a black box

drawing

JAGS and related PPLs are very different than other languages you are used to:

R, python, Java, C++, Lisp, Scala, Julia: these are imperative languages
- At the end of the day, step-by-step description of what the computer should do
- Changing the order of lines of code is generally very bad!
JAGS: a declarative language
- Specify a probability model by declaring all the conditionals
- The key lines of code (those defining conditionals) can be reordered!
- An engine takes care of computing the posterior automatically

Syntax almost the same as mathematical notation used in the lecture:

Example: Pick one of the coins from the bucket
- Math: \[p \sim \text{Uniform}\{0.1, 0.2, ..., 0.9\}\]
- JAGS:

Z ~ dcat(hyperparameters)
p <- Z / 10

where we will set hyperparameters to the vector of length 9, \((1/9, 1/9, \dots, 1/9)\)

More resources on JAGS + Quick ref

Quick reference

Some quick references for convenience (see the tutorials for more context and information). The notation in JAGS is fairly similar to standard mathematical notation, but with some slight differences.

Discrete distributions (PMFs)

Bernoulli: dbern(p)
Binomial: dbin(p, n)
Categorical: dcat(p)
Poisson: dpois(lambda)
Geometric: dnegbin(p, 1) support is \(0, 1, 2, \dots\)
Negative Binomial: dnegbin(p, r) support is \(0, 1, 2, \dots\)

Continuous distributions (PDFs)

Uniform: X ~ dunif(a, b)
Normal: X ~ dnorm(mu,1/sigma^2) uses the inverse of the variance, a parameter called the precision
Exponential: X ~ dexp(lambda) rate parameterization
Gamma: X ~ dgamma(alpha, lambda) shape-rate parameterization

Useful functions

Absolute value: abs(x)
Exponential: exp(x)
Conditional: ifelse(x,a,b) very useful to build complex models!
Logarithm (base e): log(x)
Square root: sqrt(x)
Maximum: max(x, y) works on vectors and/or more than two arguments
Minimum: min(x, y)
Sum: sum(x) sums the elements in the vector x

Additional readings

To get detailed information:

JAGS manual,
How it works under the hood: tutorial on MCMC,
A book on using JAGS to build Bayesian models: “Doing Bayesian Data Analysis: A Tutorial with R and BUGS”, John Kruschke.

Moving towards more interesting models: challenger disaster and GLMs

drawing

Context:

Challenger space shuttle exploded 73 seconds after launch in 1986
Cause: O-ring used in the solid rocket booster failed due to low temperature at the time of launch (31 F [-1 C]; all temperatures in Fahrenheit from now on)
Question investigated: was it unsafe to authorize the launch given the temperature in the morning of the launch?

Data:

drawing

data <- read.csv("challenger_data-train.csv")
knitr::kable(data, floating.environment="sidewaystable")

Date	Temperature	Damage.Incident
04/12/1981	66	0
11/12/1981	70	1
3/22/82	69	0
6/27/82	80	NA
01/11/1982	68	0
04/04/1983	67	0
6/18/83	72	0
8/30/83	73	0
11/28/83	70	0
02/03/1984	57	1
04/06/1984	63	1
8/30/84	70	1
10/05/1984	78	0
11/08/1984	67	0
1/24/85	53	1
04/12/1985	67	0
4/29/85	75	0
6/17/85	70	0
7/29/85	81	0
8/27/85	76	0
10/03/1985	79	0
10/30/85	75	1
11/26/85	76	0
01/12/1986	58	1

Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb

Bayesian approach and JAGS implementation

Model: logistic regression. Input: temperature. Output: failure indicator (i.e. binary) variable.

Questions:

What does a Bayesian version of logistic regression look like?
How to predict from it?

Graphical model and JAGS implementation

drawing

Approaches:

Discretize “slope” and “intercept” into \(K\) bins, draw the decision tree (how many leaves do not get killed?)
- Use the kill-renormalize method to compute \({\mathbb{P}}(\text{predicted binary} = 1|\text{data})\)
Use MCMC/JAGS!
- This creates a set of Monte Carlo samples \(S = \{X_1, X_2, \dots, X_M\}\)
- Here \(X_m = (\text{slope}_m, \text{intercept}_m, \text{predictedBinary}_m)\)

model {

  slope ~ dnorm(0, 0.001)
  intercept ~ dnorm(0, 0.001)

  for (i in 1:length(temperature)) {
    # incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
    p[i] <- ilogit(intercept + slope * temperature[i])
    incident[i] ~ dbern(p[i])
  }

  # predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
  predictedPr <- ilogit(intercept + slope * 31)
  predictedBinary ~ dbern(predictedPr)
}

require("rjags")

## Loading required package: rjags

## Loading required package: coda

## Linked to JAGS 4.2.0

## Loaded modules: basemod,bugs

require("coda")
require("ggplot2")

data <- read.csv( file = "challenger_data-train.csv", header = TRUE)

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
my.seed <- 1 
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  'model.bugs', 
  data = list( # Pass on the data:
    'incident' = data$Damage.Incident, 
    'temperature' = data$Temperature), 
  inits=inits)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 23
##    Unobserved stochastic nodes: 4
##    Total graph size: 111
## 
## Initializing model

samples <- 
  coda.samples(model,
               c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
               100000) # number of MCMC iterations
summary(samples)

## 
## Iterations = 1001:101000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 1e+05 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                    Mean      SD  Naive SE Time-series SE
## intercept       19.8246 9.25184 0.0292569       0.803987
## predictedBinary  0.9930 0.08361 0.0002644       0.001392
## predictedPr      0.9925 0.04013 0.0001269       0.001341
## slope           -0.3032 0.13592 0.0004298       0.011746
## 
## 2. Quantiles for each variable:
## 
##                    2.5%     25%     50%     75%    97.5%
## intercept        5.6143 13.1947 18.4525 25.1527 41.85570
## predictedBinary  1.0000  1.0000  1.0000  1.0000  1.00000
## predictedPr      0.9329  0.9989  0.9999  1.0000  1.00000
## slope           -0.6264 -0.3816 -0.2830 -0.2057 -0.09434

plot(samples)

print(HPDinterval(samples))

## [[1]]
##                      lower       upper
## intercept        4.0054009 38.42484799
## predictedBinary  1.0000000  1.00000000
## predictedPr      0.9723887  1.00000000
## slope           -0.5749028 -0.06893211
## attr(,"Probability")
## [1] 0.95

# "Jailbreaking" from coda into a saner data frame
chain <- samples[[1]]
intercept <- chain[,1]
slope <- chain[,4]
df <- data.frame("intercept" = as.numeric(intercept), "slope" = as.numeric(slope))
summary(df)

##    intercept          slope         
##  Min.   :-2.853   Min.   :-0.98311  
##  1st Qu.:13.195   1st Qu.:-0.38156  
##  Median :18.452   Median :-0.28303  
##  Mean   :19.825   Mean   :-0.30323  
##  3rd Qu.:25.153   3rd Qu.:-0.20573  
##  Max.   :66.041   Max.   : 0.02563

ggplot(df, aes(intercept, slope)) +
  geom_density_2d()

Some things to pay particular attention to:

Credible intervals do not “spill” outside of \([0, 1]\) in contrast to most frequentist-derived confidence intervals
Bivariate posterior viz via “jail breaking” coda

Second new example

In its early days, Google used a method in the spirit of what is described below to tweak their website. Let us say that they want to decide if they should use a white background or a black background (or some other cosmetic feature).

Each time enters the website for the first time, flip a fair (virtual) coin.
If it is head, show the website with a black background (option A), else, show the one with a white background (option B).
Record if the user click on at least one search result (as a surrogate to whether they “liked” the website or not).

After a short interval of time, they record the following results:

number of times A was shown: 51
number of times A was shown and the user clicked: 45
number of times B was shown: 47
number of times B was shown and the user clicked: 37

Think about the following questions:

What would be a model suitable for a Bayesian analysis of this problem?
How to make a decision (A or B) based on the MCMC output?
Implementing the idea using JAGS.

JAGS implementation

require("rjags")
require("coda")
require("ggplot2")

modelstring="
  model {
  
    pA ~ dunif(0, 1)
    pB ~ dunif(0, 1)
  
    sA ~ dbin(pA, nA)
    sB ~ dbin(pB, nB)
    
    AisBetterThanB <- ifelse(pA > pB, 1, 0)
  }
"

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
my.seed <- 1 
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  textConnection(modelstring), 
  data = list( # Pass on the data:
    'nA' = 51,
    'sA' = 45, 
    'nB' = 47,
    'sB' = 37), 
  inits=inits)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 2
##    Unobserved stochastic nodes: 2
##    Total graph size: 16
## 
## Initializing model

samples <- 
  coda.samples(model,
               c('pA', 'pB', 'AisBetterThanB'), # These are the variables we want to monitor (plot, etc)
               10000) # number of MCMC iterations
summary(samples)

## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                  Mean      SD  Naive SE Time-series SE
## AisBetterThanB 0.8904 0.31241 0.0031241      0.0032970
## pA             0.8673 0.04641 0.0004641      0.0004566
## pB             0.7762 0.05888 0.0005888      0.0005689
## 
## 2. Quantiles for each variable:
## 
##                  2.5%    25%    50%    75%  97.5%
## AisBetterThanB 0.0000 1.0000 1.0000 1.0000 1.0000
## pA             0.7620 0.8394 0.8720 0.9007 0.9439
## pB             0.6529 0.7388 0.7799 0.8179 0.8803

plot(samples)

print(HPDinterval(samples))

## [[1]]
##                    lower     upper
## AisBetterThanB 0.0000000 1.0000000
## pA             0.7743609 0.9493196
## pB             0.6578581 0.8831351
## attr(,"Probability")
## [1] 0.95

chain <- samples[[1]]
pA <- chain[,2]
pB <- chain[,3]
df <- data.frame("pA" = as.numeric(pA), "pB" = as.numeric(pB))
summary(df)

##        pA               pB        
##  Min.   :0.5934   Min.   :0.5235  
##  1st Qu.:0.8394   1st Qu.:0.7388  
##  Median :0.8720   Median :0.7799  
##  Mean   :0.8673   Mean   :0.7762  
##  3rd Qu.:0.9007   3rd Qu.:0.8179  
##  Max.   :0.9788   Max.   :0.9465

ggplot(df, aes(pA, pB)) +
  xlim(0, 1) +
  ylim(0, 1) +
  geom_abline(intercept = 0, slope = 1) +
  geom_point(size = 0.3, alpha = 0.3)

ggplot(df, aes(pA, pB)) +
  xlim(0, 1) +
  ylim(0, 1) +
  geom_abline(intercept = 0, slope = 1) +
  geom_density_2d()

Fraction of the points in \(S\) that fall in the lower triangle approximates \({\mathbb{P}}(p_A > p_B | \text{data})\)
This approximates the integral of the posterior density in the region \(p_A > p_B\):

\[\frac{\text{# points in lower triangle}}{M} \approx {\mathbb{P}}(p_A > p_B | \text{data}) = \int \int_{x > y} f_{p_A, p_B|\text{data}}(x, y) {\text{d}}x {\text{d}}y\]

The probability \({\mathbb{P}}(p_A > p_B | \text{data})\) can be thought of as Bayesian counterpart for “significance”
In practice, you probably also want to report a measure of effect size.
- If the difference is very small between \(p_A\) and \(p_B\), but you have tons of data \({\mathbb{P}}(p_A > p_B | \text{data})\) will still be close to one.
- Effect size: roughly, how big is the difference \(\text{diff} = p_A - p_B\)?
- Significance: how reliability have we inferred the sign of the difference?
- How would you approach the problem of assessing effect size in our AB model?

Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb

End of lecture