# Overview

• Goal in next few topics:
• introduce prototypical components of Bayesian models
• at the same time provide pointers at different extensions and ways to combine them
• This topic: regression
• classical statistical problem, I assume you know the basics
• but please stop me if I slip a piece of jargon you are not familiar with
• Sub-topics today:
• Generalized Linear Models (GLM)
• Missing data
• Sparsity via Spike-and-Slab

# Moving towards more interesting models: challenger disaster and GLMs

Context:

• Challenger space shuttle exploded 73 seconds after launch in 1986
• Cause: O-ring used in the solid rocket booster failed due to low temperature at the time of launch (31 F [-1 C]; all temperatures in Fahrenheit from now on)
• Question investigated: was it unsafe to authorize the launch given the temperature in the morning of the launch?

Data:

data <- read.csv("challenger_data-train.csv")
knitr::kable(data, floating.environment="sidewaystable")
Date Temperature Damage.Incident
04/12/1981 66 0
11/12/1981 70 1
3/22/82 69 0
6/27/82 80 NA
01/11/1982 68 0
04/04/1983 67 0
6/18/83 72 0
8/30/83 73 0
11/28/83 70 0
02/03/1984 57 1
04/06/1984 63 1
8/30/84 70 1
10/05/1984 78 0
11/08/1984 67 0
1/24/85 53 1
04/12/1985 67 0
4/29/85 75 0
6/17/85 70 0
7/29/85 81 0
8/27/85 76 0
10/03/1985 79 0
10/30/85 75 1
11/26/85 76 0
01/12/1986 58 1

# Regression setup

Regression models (Bayesian and non-Bayesian) involve the following quantities:

• output variables: instances of which we try to “predict”
• also known as “target”, “label”, “predicted variable”, “regressand”, …
• sometimes observed (“training instances”), sometimes unobserved (“test instances”)
• in our example?
• input variables: what we use as the basis of each prediction
• also known as “independent variables”, “covariates”, “predictor”, “regressors”, “feature”,..
• typically always observed (both at training and test time)
• parameters: auxiliary quantities that encode a function mapping inputs to (information on) output

# Bayesian approach

Model: logistic regression. Input: temperature. Output: failure indicator (i.e. binary) variable.

Questions:

• What does a Bayesian version of logistic regression look like?
• How to predict from it?

# Graphical model and Blang implementation

Blang implementation

# Alternative: JAGS implementation

model {

slope ~ dnorm(0, 0.001)
intercept ~ dnorm(0, 0.001)

for (i in 1:length(temperature)) {
# incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
p[i] <- ilogit(intercept + slope * temperature[i])
incident[i] ~ dbern(p[i])
}

# predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
predictedPr <- ilogit(intercept + slope * 31)
predictedBinary ~ dbern(predictedPr)
}
require("rjags")
require("coda")
require("ggplot2")

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is
# deterministic given that initial seed.
my.seed <- 1
set.seed(my.seed)
inits <- list(.RNG.name = "base::Mersenne-Twister",
.RNG.seed = my.seed)

model <- jags.model(
'model.bugs',
data = list( # Pass on the data:
'incident' = data$Damage.Incident, 'temperature' = data$Temperature),
inits=inits) 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 23
##    Unobserved stochastic nodes: 4
##    Total graph size: 111
##
## Initializing model
samples <-
coda.samples(model,
c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
100000) # number of MCMC iterations 100000
summary(samples)
##
## Iterations = 1001:101000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 1e+05
##
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
##
##                    Mean      SD  Naive SE Time-series SE
## intercept       19.8246 9.25184 0.0292569       0.803987
## predictedBinary  0.9930 0.08361 0.0002644       0.001392
## predictedPr      0.9925 0.04013 0.0001269       0.001341
## slope           -0.3032 0.13592 0.0004298       0.011746
##
## 2. Quantiles for each variable:
##
##                    2.5%     25%     50%     75%    97.5%
## intercept        5.6143 13.1947 18.4525 25.1527 41.85570
## predictedBinary  1.0000  1.0000  1.0000  1.0000  1.00000
## predictedPr      0.9329  0.9989  0.9999  1.0000  1.00000
## slope           -0.6264 -0.3816 -0.2830 -0.2057 -0.09434
plot(samples)

print(HPDinterval(samples))
## [[1]]
##                      lower       upper
## intercept        4.0054009 38.42484799
## predictedBinary  1.0000000  1.00000000
## predictedPr      0.9723887  1.00000000
## slope           -0.5749028 -0.06893211
## attr(,"Probability")
## [1] 0.95
 # "Jailbreaking" from coda into a saner data frame
chain <- samples[[1]]
intercept <- chain[,1]
slope <- chain[,4]
df <- data.frame("intercept" = as.numeric(intercept), "slope" = as.numeric(slope))
summary(df)
##    intercept          slope
##  Min.   :-2.853   Min.   :-0.98311
##  1st Qu.:13.195   1st Qu.:-0.38156
##  Median :18.452   Median :-0.28303
##  Mean   :19.825   Mean   :-0.30323
##  3rd Qu.:25.153   3rd Qu.:-0.20573
##  Max.   :66.041   Max.   : 0.02563
ggplot(df, aes(intercept, slope)) +
geom_density_2d()

For a more modern way of using the output of JAGS: http://mjskay.github.io/tidybayes/

Recall our Bayes estimation framework from last week:

${\textrm{argmin}}\{{\mathbf{E}}[L(a, Z)|X], a \in {\mathcal{A}}\}$

### Poll: How to cast regression into our Bayes estimation framework?

1. Any assignments of $$Z$$ and $$X$$ is possible
2. $$Z$$ contains test outputs; while $$X$$ contains all others
3. $$Z$$ contains parameters; while $$X$$ contains all others
4. $$Z$$ contains parameters, test outputs; while $$X$$ contains training outputs, and all inputs
5. $$Z$$ contains parameters, test outputs; while $$X$$ contains training outputs

# Shortcut

• When the input variables are ALWAYS observed (both at test and training), it is a waste of time to put a prior $$p({\text{inputs}})$$ on them
• Recall: $$\pi({\text{unknowns}}) \propto \gamma({\text{inputs}}, {\text{unknowns}}, {\text{others}})$$
• As usual, “$$\dots \propto \dots$$” means “$$\dots = C \dots$$”, where $$C$$ is a constant that does not depend on the unkown variables
• We have by chain rule: $$\gamma({\text{inputs}}, {\text{unknowns}}, {\text{others}}) = p({\text{inputs}}) p({\text{unknowns}}, {\text{others}}| {\text{inputs}}) = C p({\text{unknowns}}, {\text{others}}| {\text{inputs}})$$
• Hence if you have two different priors $$p_1({\text{inputs}})$$ and $$p_2({\text{inputs}})$$, the posterior will be the same
• Practical situation where does this shortcut does not work?

# Prediction

• Recall: first objective is to predict failure at 31F

### Poll: what is the Bayesian approach to this prediction problem

1. Compute the posterior mean of the slope and intercept, $$\hat a_\text{mean}$$ and $$\hat b_\text{mean}$$, and return $$\text{logistic}(31 \hat a_\text{mean}+ \hat b_\text{mean})$$
2. Compute the MAP of the slope and intercept, $$\hat a_\text{MAP}$$ and $$\hat b_\text{MAP}$$, and return $$\text{logistic}(31 \hat a_\text{MAP}+ \hat b_\text{MAP})$$
3. Compute the posterior median of the slope and intercept, $$\hat a_\text{med}$$ and $$\hat b_\text{med}$$, and return $$\text{logistic}(31 \hat a_\text{med}+ \hat b_\text{med})$$
4. None of the above

# Prediction

• All of these options are examples of the “plug-in approach” and are non-Bayesian
• Frequentist “plug-in approach”
• Find a point estimate on parameters, e.g. slope and intercept $$\hat a$$ and $$\hat b$$
• Plug-in into the arguments of the likelihood function of the datapoint to predict, e.g. $$\text{logistic}(31 \hat a+ \hat b)$$
• Bayesian:
• Do not plug-in! (except in some rare situation where it coincides with the strategy below)
• Instead: apply the Bayes estimator!
• Introduce the variable you want to predict into the model
• Use the posterior over that variable
• Introduce a loss function for this prediction

# Zero-one loss

• Here $$z = (\slope, \intercept, y)$$, where $$y\in\{0,1\}$$ denotes the next launch, i.e. is a single binary variable we seek to predict.
• Let also denote $$a \in \{0, 1\}$$ as our action (prediction).
• Let us pick the loss $$L(a, z) = {{\bf 1}}[y \neq a]$$.

\begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[{{\bf 1}}[Y \neq a] | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}(Y \neq a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ 1 - {\mathbb{P}}(Y = a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmax}}\{ {\mathbb{P}}(Y = a | X) : a \in {\mathcal{A}}\} \\ &= \text{MAP estimator} \end{aligned}

So we obtain a MAP estimator, but on the predicted binary variable directly (contrast with the MLE plug-in approach, which maximizes continuous parameters first, then plug-in these estimates to predict)

# GLMs: general framework

• Look at the data type of the output variable, this will guide the choice of likelihood model
• Are you trying to predict..
• a real number? Replace the Bernoulli by a normal (or better: fat tail distribution such as t distribution, for robustness to outliers)
• an integer? Replace Bernoulli by Poisson (or Negative Binomial, or other integer valued distribution)
• etc
• General pattern:

${\text{output}}| {\text{inputs}}, {\text{parameter}}\sim \text{SuitableDistribution}(f({\text{parameter}}, {\text{inputs}}))$

• $$f$$ should map to the range of parameters permitted by your “SuitableDistribution”
• known in the GLM literature as the inverse of the link function
• often a composition of a linear function with a non-linear “squashing” function if the output is constrained

# Sparsity and model selection

• Sometimes, the end goal is not to predict but rather to select a scientific model amongst candidate models
• Example:
• Model 1: the temperature has no effect on the probability of failure
• Model 2: the model we developed today
• Notice: if we set the slope parameter to zero, model 2 recovers model 1 as a special case
• Loss function for model selection?
• Bayes estimator?

# Bayes estimator

The Bayes estimator is

${\textrm{argmax}}\{{\mathbb{P}}(\text{slope}=a|X) : a \in \{0, 1\}\}$

### Poll: would you use the Bayes estimator to select between the two models?

1. Yes, this is the optimal approach
2. No, instead I would look if the credible intervals on the slope contains 0
3. No, because this would always select the more complex model

# Difficulty

• Under the prior, $${\mathbb{P}}(\text{slope}=0) = 0$$
• Note:

${\mathbb{P}}(A) = 0 \Longrightarrow {\mathbb{P}}(A | B) = \frac{{\mathbb{P}}(A \cap B)}{{\mathbb{P}}(B)} \le \frac{{\mathbb{P}}(A)}{{\mathbb{P}}(B)} = 0$

• Hence the simpler model will never be selected!
• The answer based on credible intervals in the last poll is also incorrect: another example of plug-in!

# Solution: change the prior to put positive mass on the event (slope = 0)

• Mixture of a continuous prior (for the non-zero slope model) and a point mass at zero. Can be defined as the prior satisfying the following equation for all $$a < b$$: ${\mathbb{P}}(\text{slope} \in [a, b]) = p {{\bf 1}}[0 \in [a, b]] + (1-p) \int_a^b \phi(x) {\text{d}}x,$ where:
• $$p\in [0, 1]$$ in an additional parameter which controls prior belief in the simpler model, $${\mathbb{P}}(\text{slope}=0) = p$$
• $$\phi$$ is the density of the continuous part of the prior (e.g., normal)
• After this change, the Bayes estimator will behave well
• Lesson: if there are posterior probabilities in the Bayes estimator, choose your prior so as to ensure that the correspond prior probabilities are positive

Blang implementation

• Blang, developed in my group, which efficiently supports both discrete, combinatorial and continuous random variables; especially useful for hard non-convex problems. Sections 1 to 6 will be most relevant at this point, but you can have a peak at 7, 10, 11.
• Probabilistic programming for hacker, Cameron Davidson-Pilon
• Frameworks focussing on autodiff+HMC, e.g. for continuous random variable and nice log-concave targets: Stan, Edward, pyMC3, pyro (some support discrete variable at various degrees of efficiency).
• Languages focussing on maximum expressivity: Anglican, Birch

# JAGS version of Spike-and-Slab

model.name <- "spike-slab.bugs"
{ # Extra bracket needed only for R markdown files
sink(model.name)
cat("
model {

slope ~ dnorm(0, 0.001)
intercept ~ dnorm(0, 0.001)
selected ~ dbern(0.5)

for (i in 1:length(temperature)) {
# incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
p[i] <- ilogit(intercept + selected * slope * temperature[i])
incident[i] ~ dbern(p[i])
}

# predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
predictedPr <- ilogit(intercept + selected * slope * 31)
predictedBinary ~ dbern(predictedPr)
}

",fill = TRUE)
sink()
} # Extra bracket needed only for R markdown files

require("rjags")
require("coda")
require("ggplot2")

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is
# deterministic given that initial seed.
my.seed <- 1
set.seed(my.seed)
inits <- list(.RNG.name = "base::Mersenne-Twister",
.RNG.seed = my.seed)

model <- jags.model(
model.name,
data = list( # Pass on the data:
'incident' = data$Damage.Incident, 'temperature' = data$Temperature),
inits=inits) 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 23
##    Unobserved stochastic nodes: 5
##    Total graph size: 113
##
## Initializing model
samples <-
coda.samples(model,
c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
100000) # number of MCMC iterations 100000
summary(samples)
##
## Iterations = 1001:101000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 1e+05
##
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
##
##                    Mean      SD  Naive SE Time-series SE
## intercept        3.0611  8.8648 0.0280329        1.58688
## predictedBinary  0.4461  0.4971 0.0015719        0.02981
## predictedPr      0.4450  0.2887 0.0009129        0.07922
## slope           -0.0205 28.1810 0.0891162        0.11244
##
## 2. Quantiles for each variable:
##
##                     2.5%      25%     50%     75% 97.5%
## intercept        -1.7724  -1.0794 -0.6995 -0.1377 28.65
## predictedBinary   0.0000   0.0000  0.0000  1.0000  1.00
## predictedPr       0.1452   0.2536  0.3319  0.4651  1.00
## slope           -59.1470 -15.1741 -0.2526 15.2250 58.56
plot(samples)