Moving towards more interesting models: challenger disaster and GLMs

drawing

Context:

Challenger space shuttle exploded 73 seconds after launch in 1986
Cause: O-ring used in the solid rocket booster failed due to low temperature at the time of launch (31 F [-1 C]; all temperatures in Fahrenheit from now on)
Question investigated: was it unsafe to authorize the launch given the temperature in the morning of the launch?

Data:

drawing

data <- read.csv("challenger_data-train.csv")
knitr::kable(data, floating.environment="sidewaystable")

Date	Temperature	Damage.Incident
04/12/1981	66	0
11/12/1981	70	1
3/22/82	69	0
6/27/82	80	NA
01/11/1982	68	0
04/04/1983	67	0
6/18/83	72	0
8/30/83	73	0
11/28/83	70	0
02/03/1984	57	1
04/06/1984	63	1
8/30/84	70	1
10/05/1984	78	0
11/08/1984	67	0
1/24/85	53	1
04/12/1985	67	0
4/29/85	75	0
6/17/85	70	0
7/29/85	81	0
8/27/85	76	0
10/03/1985	79	0
10/30/85	75	1
11/26/85	76	0
01/12/1986	58	1

Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb

Regression setup

Regression models (Bayesian and non-Bayesian) involve the following quantities:

output variables: instances of which we try to “predict”
- also known as “target”, “label”, “predicted variable”, “regressand”, …
- sometimes observed (“training instances”), sometimes unobserved (“test instances”)
- in our example?
input variables: what we use as the basis of each prediction
- also known as “independent variables”, “covariates”, “predictor”, “regressors”, “feature”,..
- typically always observed (both at training and test time)
parameters: auxiliary quantities that encode a function mapping inputs to (information on) output

Bayesian approach

Model: logistic regression. Input: temperature. Output: failure indicator (i.e. binary) variable.

Questions:

What does a Bayesian version of logistic regression look like?
How to predict from it?

Graphical model and Blang implementation

Blang implementation

drawing

Alternative: JAGS implementation

model {

  slope ~ dnorm(0, 0.001)
  intercept ~ dnorm(0, 0.001)

  for (i in 1:length(temperature)) {
    # incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
    p[i] <- ilogit(intercept + slope * temperature[i])
    incident[i] ~ dbern(p[i])
  }

  # predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
  predictedPr <- ilogit(intercept + slope * 31)
  predictedBinary ~ dbern(predictedPr)
}

require("rjags")
require("coda")
require("ggplot2")

data <- read.csv( file = "challenger_data-train.csv", header = TRUE)

 # Make the simulations reproducible
 # 1 is arbitrary, but the rest of the simulation is 
 # deterministic given that initial seed.
my.seed <- 1 
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  'model.bugs', 
  data = list( # Pass on the data:
    'incident' = data$Damage.Incident, 
    'temperature' = data$Temperature), 
  inits=inits)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 23
##    Unobserved stochastic nodes: 4
##    Total graph size: 111
## 
## Initializing model

samples <- 
  coda.samples(model,
               c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
               100000) # number of MCMC iterations 100000
summary(samples)

## 
## Iterations = 1001:101000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 1e+05 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                    Mean      SD  Naive SE Time-series SE
## intercept       19.8246 9.25184 0.0292569       0.803987
## predictedBinary  0.9930 0.08361 0.0002644       0.001392
## predictedPr      0.9925 0.04013 0.0001269       0.001341
## slope           -0.3032 0.13592 0.0004298       0.011746
## 
## 2. Quantiles for each variable:
## 
##                    2.5%     25%     50%     75%    97.5%
## intercept        5.6143 13.1947 18.4525 25.1527 41.85570
## predictedBinary  1.0000  1.0000  1.0000  1.0000  1.00000
## predictedPr      0.9329  0.9989  0.9999  1.0000  1.00000
## slope           -0.6264 -0.3816 -0.2830 -0.2057 -0.09434

plot(samples)

print(HPDinterval(samples))

## [[1]]
##                      lower       upper
## intercept        4.0054009 38.42484799
## predictedBinary  1.0000000  1.00000000
## predictedPr      0.9723887  1.00000000
## slope           -0.5749028 -0.06893211
## attr(,"Probability")
## [1] 0.95

 # "Jailbreaking" from coda into a saner data frame
chain <- samples[[1]]
intercept <- chain[,1]
slope <- chain[,4]
df <- data.frame("intercept" = as.numeric(intercept), "slope" = as.numeric(slope))
summary(df)

##    intercept          slope         
##  Min.   :-2.853   Min.   :-0.98311  
##  1st Qu.:13.195   1st Qu.:-0.38156  
##  Median :18.452   Median :-0.28303  
##  Mean   :19.825   Mean   :-0.30323  
##  3rd Qu.:25.153   3rd Qu.:-0.20573  
##  Max.   :66.041   Max.   : 0.02563

ggplot(df, aes(intercept, slope)) +
  geom_density_2d()

For a more modern way of using the output of JAGS: http://mjskay.github.io/tidybayes/

Linking to Bayes estimation

Recall our Bayes estimation framework from last week:

\[{\textrm{argmin}}\{{\mathbf{E}}[L(a, Z)|X], a \in {\mathcal{A}}\}\]

Poll: How to cast regression into our Bayes estimation framework?

Any assignments of \(Z\) and \(X\) is possible
\(Z\) contains test outputs; while \(X\) contains all others
\(Z\) contains parameters; while \(X\) contains all others
\(Z\) contains parameters, test outputs; while \(X\) contains training outputs, and all inputs
\(Z\) contains parameters, test outputs; while \(X\) contains training outputs

Shortcut

When the input variables are ALWAYS observed (both at test and training), it is a waste of time to put a prior \(p({\text{inputs}})\) on them
Recall: \(\pi({\text{unknowns}}) \propto \gamma({\text{inputs}}, {\text{unknowns}}, {\text{others}})\)
- As usual, “\(\dots \propto \dots\)” means “\(\dots = C \dots\)”, where \(C\) is a constant that does not depend on the unkown variables
- We have by chain rule: \(\gamma({\text{inputs}}, {\text{unknowns}}, {\text{others}}) = p({\text{inputs}}) p({\text{unknowns}}, {\text{others}}| {\text{inputs}}) = C p({\text{unknowns}}, {\text{others}}| {\text{inputs}})\)
- Hence if you have two different priors \(p_1({\text{inputs}})\) and \(p_2({\text{inputs}})\), the posterior will be the same
Practical situation where does this shortcut does not work?

Prediction

Recall: first objective is to predict failure at 31F

Poll: what is the Bayesian approach to this prediction problem

Compute the posterior mean of the slope and intercept, \(\hat a_\text{mean}\) and \(\hat b_\text{mean}\), and return \(\text{logistic}(31 \hat a_\text{mean}+ \hat b_\text{mean})\)
Compute the MAP of the slope and intercept, \(\hat a_\text{MAP}\) and \(\hat b_\text{MAP}\), and return \(\text{logistic}(31 \hat a_\text{MAP}+ \hat b_\text{MAP})\)
Compute the posterior median of the slope and intercept, \(\hat a_\text{med}\) and \(\hat b_\text{med}\), and return \(\text{logistic}(31 \hat a_\text{med}+ \hat b_\text{med})\)
None of the above

Prediction

All of these options are examples of the “plug-in approach” and are non-Bayesian
Frequentist “plug-in approach”
- Find a point estimate on parameters, e.g. slope and intercept \(\hat a\) and \(\hat b\)
- Plug-in into the arguments of the likelihood function of the datapoint to predict, e.g. \(\text{logistic}(31 \hat a+ \hat b)\)
Bayesian:
- Do not plug-in! (except in some rare situation where it coincides with the strategy below)
- Instead: apply the Bayes estimator!
  - Introduce the variable you want to predict into the model
  - Use the posterior over that variable
  - Introduce a loss function for this prediction

Zero-one loss

Here \(z = (\slope, \intercept, y)\), where \(y\in\{0,1\}\) denotes the next launch, i.e. is a single binary variable we seek to predict.
Let also denote \(a \in \{0, 1\}\) as our action (prediction).
Let us pick the loss \(L(a, z) = {{\bf 1}}[y \neq a]\).

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[{{\bf 1}}[Y \neq a] | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}(Y \neq a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ 1 - {\mathbb{P}}(Y = a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmax}}\{ {\mathbb{P}}(Y = a | X) : a \in {\mathcal{A}}\} \\ &= \text{MAP estimator} \end{aligned} \]

So we obtain a MAP estimator, but on the predicted binary variable directly (contrast with the MLE plug-in approach, which maximizes continuous parameters first, then plug-in these estimates to predict)

GLMs: general framework

Look at the data type of the output variable, this will guide the choice of likelihood model
Are you trying to predict..
- a real number? Replace the Bernoulli by a normal (or better: fat tail distribution such as t distribution, for robustness to outliers)
- an integer? Replace Bernoulli by Poisson (or Negative Binomial, or other integer valued distribution)
- etc
General pattern:

\[{\text{output}}| {\text{inputs}}, {\text{parameter}}\sim \text{SuitableDistribution}(f({\text{parameter}}, {\text{inputs}}))\]

\(f\) should map to the range of parameters permitted by your “SuitableDistribution”
known in the GLM literature as the inverse of the link function
often a composition of a linear function with a non-linear “squashing” function if the output is constrained

Sparsity and model selection

Sometimes, the end goal is not to predict but rather to select a scientific model amongst candidate models
Example:
- Model 1: the temperature has no effect on the probability of failure
- Model 2: the model we developed today
Notice: if we set the slope parameter to zero, model 2 recovers model 1 as a special case
Loss function for model selection?
Bayes estimator?

Bayes estimator

The Bayes estimator is

\[{\textrm{argmax}}\{{\mathbb{P}}(\text{slope}=a|X) : a \in \{0, 1\}\}\]

Poll: would you use the Bayes estimator to select between the two models?

Yes, this is the optimal approach
No, instead I would look if the credible intervals on the slope contains 0
No, because this would always select the more complex model

Difficulty

Under the prior, \({\mathbb{P}}(\text{slope}=0) = 0\)
Note:

\[{\mathbb{P}}(A) = 0 \Longrightarrow {\mathbb{P}}(A | B) = \frac{{\mathbb{P}}(A \cap B)}{{\mathbb{P}}(B)} \le \frac{{\mathbb{P}}(A)}{{\mathbb{P}}(B)} = 0 \]

Hence the simpler model will never be selected!
The answer based on credible intervals in the last poll is also incorrect: another example of plug-in!

Solution: change the prior to put positive mass on the event (slope = 0)

Mixture of a continuous prior (for the non-zero slope model) and a point mass at zero. Can be defined as the prior satisfying the following equation for all \(a < b\): \[{\mathbb{P}}(\text{slope} \in [a, b]) = p {{\bf 1}}[0 \in [a, b]] + (1-p) \int_a^b \phi(x) {\text{d}}x,\] where:
- \(p\in [0, 1]\) in an additional parameter which controls prior belief in the simpler model, \({\mathbb{P}}(\text{slope}=0) = p\)
- \(\phi\) is the density of the continuous part of the prior (e.g., normal)
After this change, the Bayes estimator will behave well
Lesson: if there are posterior probabilities in the Bayes estimator, choose your prior so as to ensure that the correspond prior probabilities are positive

Blang implementation

Readings

Start reading about one or two PPLs, e.g.:

Blang, developed in my group, which efficiently supports both discrete, combinatorial and continuous random variables; especially useful for hard non-convex problems. Sections 1 to 6 will be most relevant at this point, but you can have a peak at 7, 10, 11.
Probabilistic programming for hacker, Cameron Davidson-Pilon
- Probabilistic programming book giving a programmer-centric view of Bayesian analysis
- Many lively examples
- Free ebook: http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/
Frameworks focussing on autodiff+HMC, e.g. for continuous random variable and nice log-concave targets: Stan, Edward, pyMC3, pyro (some support discrete variable at various degrees of efficiency).
Languages focussing on maximum expressivity: Anglican, Birch

JAGS version of Spike-and-Slab

model.name <- "spike-slab.bugs"
{ # Extra bracket needed only for R markdown files
sink(model.name)
cat(" 
model {

  slope ~ dnorm(0, 0.001)
  intercept ~ dnorm(0, 0.001)
  selected ~ dbern(0.5)

  for (i in 1:length(temperature)) {
    # incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
    p[i] <- ilogit(intercept + selected * slope * temperature[i])
    incident[i] ~ dbern(p[i])
  }

  # predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
  predictedPr <- ilogit(intercept + selected * slope * 31)
  predictedBinary ~ dbern(predictedPr)
}

",fill = TRUE)
sink()
} # Extra bracket needed only for R markdown files

require("rjags")
require("coda")
require("ggplot2")

data <- read.csv( file = "challenger_data-train.csv", header = TRUE)

 # Make the simulations reproducible
 # 1 is arbitrary, but the rest of the simulation is 
 # deterministic given that initial seed.
my.seed <- 1 
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  model.name, 
  data = list( # Pass on the data:
    'incident' = data$Damage.Incident, 
    'temperature' = data$Temperature), 
  inits=inits)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 23
##    Unobserved stochastic nodes: 5
##    Total graph size: 113
## 
## Initializing model

samples <- 
  coda.samples(model,
               c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
               100000) # number of MCMC iterations 100000
summary(samples)

## 
## Iterations = 1001:101000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 1e+05 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                    Mean      SD  Naive SE Time-series SE
## intercept        3.0611  8.8648 0.0280329        1.58688
## predictedBinary  0.4461  0.4971 0.0015719        0.02981
## predictedPr      0.4450  0.2887 0.0009129        0.07922
## slope           -0.0205 28.1810 0.0891162        0.11244
## 
## 2. Quantiles for each variable:
## 
##                     2.5%      25%     50%     75% 97.5%
## intercept        -1.7724  -1.0794 -0.6995 -0.1377 28.65
## predictedBinary   0.0000   0.0000  0.0000  1.0000  1.00
## predictedPr       0.1452   0.2536  0.3319  0.4651  1.00
## slope           -59.1470 -15.1741 -0.2526 15.2250 58.56

plot(samples)

print(HPDinterval(samples))

## [[1]]
##                       lower    upper
## intercept        -2.3624604 24.19539
## predictedBinary   0.0000000  1.00000
## predictedPr       0.1685827  1.00000
## slope           -60.0940038 57.54338
## attr(,"Probability")
## [1] 0.95

 # "Jailbreaking" from coda into a saner data frame
chain <- samples[[1]]
intercept <- chain[,1]
slope <- chain[,4]
df <- data.frame("intercept" = as.numeric(intercept), "slope" = as.numeric(slope))
summary(df)

##    intercept           slope          
##  Min.   :-3.4135   Min.   :-127.3777  
##  1st Qu.:-1.0794   1st Qu.: -15.1741  
##  Median :-0.6995   Median :  -0.2526  
##  Mean   : 3.0611   Mean   :  -0.0205  
##  3rd Qu.:-0.1377   3rd Qu.:  15.2250  
##  Max.   :57.1284   Max.   : 131.9987

ggplot(df, aes(intercept, slope)) +
  geom_density_2d()

Exercise

First, review the concepts in this topic by going over this exercise set: Regression.html

Second, as an optional extra problem, consider the following alternative approach to the Challenger data: a change-point model.

Assume there is critical temperature, \(T\), such that the failure probability is allowed to take a different value for temperatures smaller versus higher than \(T\). Otherwise, the failure probability is assumed to be constant within each region.
Implement this idea in a new Blang model. Report the posterior probability distribution.
Notice that Blang output in results/latest/logNormalizationEstimate.csv an estimate of the quantity \(\log p(\text{data})\), i.e. the logarithm of the marginal density of the data (use the one obtained from the “stepping stone” estimator, it is also shown in the standard output shown during the execution of the program). As we will see, this quantity is useful to compare different models (higher marginal likelihood means the model is “less surprised by the data”). Use this criterion to decide which of the models we considered seems more adequate.

Regression

Overview

Moving towards more interesting models: challenger disaster and GLMs

Regression setup

Bayesian approach

Graphical model and Blang implementation

Alternative: JAGS implementation

Linking to Bayes estimation

Poll: How to cast regression into our Bayes estimation framework?

Shortcut

Prediction

Poll: what is the Bayesian approach to this prediction problem

Prediction

Zero-one loss

GLMs: general framework

Sparsity and model selection

Bayes estimator

Poll: would you use the Bayes estimator to select between the two models?

Difficulty

Solution: change the prior to put positive mass on the event (slope = 0)

Readings

JAGS version of Spike-and-Slab

Exercise