STAT 520 - Bayesian Analysis

Alexandre Bouchard-Côté

3/6/2019

Goals

Today:

“Homework” Software setup! Be ready to use next week:

Next week: Bayesian model building

Logistics

Recall: First way to break consistency: violating Cromwell’s rule

drawing

Recall: Second way to break consistency: Unidentifiable models

model Unidentifiable {
  ...
  laws {
    p1 ~ ContinuousUniform(0, 1)
    p2 ~ ContinuousUniform(0, 1)
    nFails | p1, p2, nLaunches ~ Binomial(nLaunches, p1 * p2)
  }
}

drawing

Recall: Theoretical justification of consistency: Doob’s consistency theorem

Theorem (Doob’s consistency): if an exchangeable model is identifiable, Bayesian consistency holds for this model.

Corollary: if you use the wrong prior, \(\tilde \pi_0\), but \(\tilde \pi_0(A) > 0 \Longleftrightarrow \pi_0(A) > 0\), then Doob’s consistency still hold for identifiable exchangeable models.

Note: the same holds true for any point estimate based on the posterior distribution (if you get the posterior right, you will get right anything that’s derived from it)

Recall: calibration of credible intervals

set.seed(1)
K <- 1000
prior_used_for_computing_posterior <- rep(1/(K+1), K+1) # Use same prior for simulation and posterior computation
n_repeats <- 1000
alpha <- 0.1

hdi_coverage_pr <- function(n_datapoints) {
  n_inclusions <- 0
  for (repetition in seq(1:n_repeats)) {
    i <- rdunif(K + 1) - 1  # Always generate the data using a uniform prior
    true_p <- i/K
    x <- rbinom(1, n_datapoints, true_p)
    post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
    
    # This if is just a hacky way to check if true parameter is in the HDI credible interval
    if (sum(abs(true_p - high_density_intervals(alpha, post)[1,]) < 10e-10) == 1) {
      n_inclusions <- n_inclusions + 1
    }
  }
  return(n_inclusions/n_repeats) # Fraction of simulation where the true parameter was in interval
}

df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)

ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
  ylim(0, 1) +
  geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
  geom_line()

Calibration of credible intervals (misspecified case)

# Using now the same non-uniform prior as before for posterior calculation
prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)

set.seed(1)
K <- 1000
n_repeats <- 1000
alpha <- 0.1

df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)

ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
  ylim(0, 1) +
  geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
  geom_line()

Asymptotic calibration under certain misspecification

drawing

Bernstein-von Mises: under certain conditions, even when the prior is misspecified the actual coverage of credible intervals converges to the nominal coverage. (converge just means “gets arbitrarily close”)

Conditions include, but are not limited to:

  1. \(\theta \in \Theta\) needs to live in a continuous rather than discrete space! I.e.: \(\Theta \subset {\mathbf{R}}^d\)
    • Intuition: Bernstein-von Mises actually proves something stronger: convergence of the rescaled, centered posterior to a normal distribution, which is a continuous distribution (can you guess the scaling?)
    • “Bayesian Central Limit Theorem”
  2. \(p(\theta)\) (the prior) is now a density instead of a PMF,
    • we assume it is non-zero in a neighborhood of the true parameter
  3. we assume the posterior mean is consistent
    • recall: that means the error goes to zero as number of data points increases
  4. we assume the model is parametric:
    • \(\theta\) is allowed to be a vector, but you are not allowed to make it bigger as the number of observations increases

We will explore these conditions in more detail.

Survey: who is familiar with asymptotic normality of the MLE?

Practical consequence:

De Finetti: intuition and motivation

drawing

drawing

Exchangeable random variables

De Finetti: formal statement

Theorem: if \((X_1, X_2, \dots)\) are exchangeable Bernoulli random variables, then there exists a \(\theta\), a prior on \(\theta\), and a likelihood such that the observations are iid given \(\theta\).

Continuous random variables: motivation

Continuous random variables: first implementation

Recall: how to use a density to compute a probability?

drawing

\[{\mathbb{P}}(X \in [0.1, 0.3]) = \int_{0.1}^{0.3} f(x) {\text{d}}x\] drawing

Monte Carlo as a black box

Motivation:

However, for simplicity, let us start by assuming the model has only 1 real valued unknown parameter.

Idea:

\[{\mathbb{P}}(X \in [0.1, 0.3]) \approx \frac{\# \text{ points in } S \text{ that fall in }[0.1, 0.3]}{M}\]

Intuition. Let’s say you want to compute the area of this weird curved shape \(A\):

drawing

How to create Monte Carlo samples \(S\)

Guiding principle: we have two ways to compute probabilities: densities and Monte Carlo. We want that as the size of \(S\) increases, the discrepancy between the two methods should become arbitrary close to zero

Computing a mean

drawing

\[{\mathbf{E}}[X] = \int x f(x) {\text{d}}x\]

\[{\mathbf{E}}[X] \approx \frac{\sum_{i=1}^M X_i}{M}\]

\[{\mathbf{E}}[X^2] \approx \frac{\sum_{i=1}^M X^2_i}{M}\]

\[{\mathbf{E}}[g(X)] \approx \frac{\sum_{i=1}^M g(X_i)}{M}\]

More asymptotic regimes

We have done two thought experiments so far:

drawing

Note:

drawing

Lecture ended here

Moving towards more interesting models: challenger disaster and GLMs

drawing

Context:

Data:

drawing

data <- read.csv("challenger_data-train.csv")
knitr::kable(data, floating.environment="sidewaystable")
Date Temperature Damage.Incident
04/12/1981 66 0
11/12/1981 70 1
3/22/82 69 0
6/27/82 80 NA
01/11/1982 68 0
04/04/1983 67 0
6/18/83 72 0
8/30/83 73 0
11/28/83 70 0
02/03/1984 57 1
04/06/1984 63 1
8/30/84 70 1
10/05/1984 78 0
11/08/1984 67 0
1/24/85 53 1
04/12/1985 67 0
4/29/85 75 0
6/17/85 70 0
7/29/85 81 0
8/27/85 76 0
10/03/1985 79 0
10/30/85 75 1
11/26/85 76 0
01/12/1986 58 1

Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb

Bayesian approach and JAGS implementation

Model: logistic regression. Input: temperature. Output: failure indicator (i.e. binary) variable.

drawing

Approaches:

model {

  slope ~ dnorm(0, 0.001)
  intercept ~ dnorm(0, 0.001)

  for (i in 1:length(temperature)) {
    # incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
    p[i] <- ilogit(intercept + slope * temperature[i])
    incident[i] ~ dbern(p[i])
  }

  # predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
  predictedPr <- ilogit(intercept + slope * 31)
  predictedBinary ~ dbern(predictedPr)
}
require(rjags)
## Loading required package: rjags
## Loading required package: coda
## Linked to JAGS 4.2.0
## Loaded modules: basemod,bugs
require(coda)
require(ggplot2)

data <- read.csv( file = "challenger_data-train.csv", header = TRUE)

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
my.seed <- 1 
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  'model.bugs', 
  data = list( # Pass on the data:
    'incident' = data$Damage.Incident, 
    'temperature' = data$Temperature), 
  inits=inits) 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 23
##    Unobserved stochastic nodes: 4
##    Total graph size: 111
## 
## Initializing model
samples <- 
  coda.samples(model,
               c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
               100000) # number of MCMC iterations
summary(samples)
## 
## Iterations = 1001:101000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 1e+05 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                    Mean      SD  Naive SE Time-series SE
## intercept       19.8246 9.25184 0.0292569       0.803987
## predictedBinary  0.9930 0.08361 0.0002644       0.001392
## predictedPr      0.9925 0.04013 0.0001269       0.001341
## slope           -0.3032 0.13592 0.0004298       0.011746
## 
## 2. Quantiles for each variable:
## 
##                    2.5%     25%     50%     75%    97.5%
## intercept        5.6143 13.1947 18.4525 25.1527 41.85570
## predictedBinary  1.0000  1.0000  1.0000  1.0000  1.00000
## predictedPr      0.9329  0.9989  0.9999  1.0000  1.00000
## slope           -0.6264 -0.3816 -0.2830 -0.2057 -0.09434
plot(samples)

print(HPDinterval(samples))
## [[1]]
##                      lower       upper
## intercept        4.0054009 38.42484799
## predictedBinary  1.0000000  1.00000000
## predictedPr      0.9723887  1.00000000
## slope           -0.5749028 -0.06893211
## attr(,"Probability")
## [1] 0.95
# "Jailbreaking" from coda into a saner data frame
chain <- samples[[1]]
intercept <- chain[,1]
slope <- chain[,4]
df <- data.frame("intercept" = as.numeric(intercept), "slope" = as.numeric(slope))
summary(df)
##    intercept          slope         
##  Min.   :-2.853   Min.   :-0.98311  
##  1st Qu.:13.195   1st Qu.:-0.38156  
##  Median :18.452   Median :-0.28303  
##  Mean   :19.825   Mean   :-0.30323  
##  3rd Qu.:25.153   3rd Qu.:-0.20573  
##  Max.   :66.041   Max.   : 0.02563
ggplot(df, aes(intercept, slope)) +
  geom_density_2d()

Some things to pay particular attention to:

drawing

Second new example

In its early days, Google used a method in the spirit of what is described below to tweak their website. Let us say that they want to decide if they should use a white background or a black background (or some other cosmetic feature).

After a short interval of time, they record the following results:

Think about the following questions:

  1. What would be a model suitable for a Bayesian analysis of this problem?
  2. How to make a decision (A or B) based on the MCMC output?
  3. Optional: implement the idea using JAGS.
require(rjags)
require(coda)
require(ggplot2)

modelstring="
  model {
  
    pA ~ dunif(0, 1)
    pB ~ dunif(0, 1)
  
    sA ~ dbin(pA, nA)
    sB ~ dbin(pB, nB)
    
    AisBetterThanB <- ifelse(pA > pB, 1, 0)
  }
"

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
my.seed <- 1 
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  textConnection(modelstring), 
  data = list( # Pass on the data:
    'nA' = 51,
    'sA' = 45, 
    'nB' = 47,
    'sB' = 37), 
  inits=inits) 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 2
##    Unobserved stochastic nodes: 2
##    Total graph size: 16
## 
## Initializing model
samples <- 
  coda.samples(model,
               c('pA', 'pB', 'AisBetterThanB'), # These are the variables we want to monitor (plot, etc)
               10000) # number of MCMC iterations
summary(samples)
## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                  Mean      SD  Naive SE Time-series SE
## AisBetterThanB 0.8904 0.31241 0.0031241      0.0032970
## pA             0.8673 0.04641 0.0004641      0.0004566
## pB             0.7762 0.05888 0.0005888      0.0005689
## 
## 2. Quantiles for each variable:
## 
##                  2.5%    25%    50%    75%  97.5%
## AisBetterThanB 0.0000 1.0000 1.0000 1.0000 1.0000
## pA             0.7620 0.8394 0.8720 0.9007 0.9439
## pB             0.6529 0.7388 0.7799 0.8179 0.8803
plot(samples)

print(HPDinterval(samples))
## [[1]]
##                    lower     upper
## AisBetterThanB 0.0000000 1.0000000
## pA             0.7743609 0.9493196
## pB             0.6578581 0.8831351
## attr(,"Probability")
## [1] 0.95
chain <- samples[[1]]
pA <- chain[,2]
pB <- chain[,3]
df <- data.frame("pA" = as.numeric(pA), "pB" = as.numeric(pB))
summary(df)
##        pA               pB        
##  Min.   :0.5934   Min.   :0.5235  
##  1st Qu.:0.8394   1st Qu.:0.7388  
##  Median :0.8720   Median :0.7799  
##  Mean   :0.8673   Mean   :0.7762  
##  3rd Qu.:0.9007   3rd Qu.:0.8179  
##  Max.   :0.9788   Max.   :0.9465
ggplot(df, aes(pA, pB)) +
  xlim(0, 1) +
  ylim(0, 1) +
  geom_abline(intercept = 0, slope = 1) +
  geom_point(size = 0.3, alpha = 0.3)

ggplot(df, aes(pA, pB)) +
  xlim(0, 1) +
  ylim(0, 1) +
  geom_abline(intercept = 0, slope = 1) +
  geom_density_2d()

\[\frac{\text{# points in lower triangle}}{M} \approx {\mathbb{P}}(p_A > p_B | \text{data}) = \int \int_{x > y} f_{p_A, p_B|\text{data}}(x, y) {\text{d}}x {\text{d}}y\]

Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb