Alexandre Bouchard-Côté

- Goal in next few topics:
- introduce prototypical components of Bayesian models
- at the same time provide pointers at different extensions and ways to combine them

- This topic: regression
- classical statistical problem, I assume you know the basics
- but please stop me if I slip a piece of jargon you are not familiar with

- Sub-topics today:
- Generalized Linear Models (GLM)
- Missing data
- Sparsity via Spike-and-Slab

**Context:**

- Challenger space shuttle exploded 73 seconds after launch in 1986
- Cause: O-ring used in the solid rocket booster failed due to low temperature at the time of launch (31 F [-1 C]; all temperatures in Fahrenheit from now on)
- Question investigated: was it unsafe to authorize the launch given the temperature in the morning of the launch?

Data:

```
data <- read.csv("challenger_data-train.csv")
knitr::kable(data, floating.environment="sidewaystable")
```

Date | Temperature | Damage.Incident |
---|---|---|

04/12/1981 | 66 | 0 |

11/12/1981 | 70 | 1 |

3/22/82 | 69 | 0 |

6/27/82 | 80 | NA |

01/11/1982 | 68 | 0 |

04/04/1983 | 67 | 0 |

6/18/83 | 72 | 0 |

8/30/83 | 73 | 0 |

11/28/83 | 70 | 0 |

02/03/1984 | 57 | 1 |

04/06/1984 | 63 | 1 |

8/30/84 | 70 | 1 |

10/05/1984 | 78 | 0 |

11/08/1984 | 67 | 0 |

1/24/85 | 53 | 1 |

04/12/1985 | 67 | 0 |

4/29/85 | 75 | 0 |

6/17/85 | 70 | 0 |

7/29/85 | 81 | 0 |

8/27/85 | 76 | 0 |

10/03/1985 | 79 | 0 |

10/30/85 | 75 | 1 |

11/26/85 | 76 | 0 |

01/12/1986 | 58 | 1 |

Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb

Regression models (Bayesian and non-Bayesian) involve the following quantities:

- output variables: instances of which we try to “predict”
- also known as “target”, “label”, “predicted variable”, “regressand”, …
- sometimes observed (“
**training**instances”), sometimes unobserved (“**test**instances”) - in our example?

- input variables: what we use as the basis of each prediction
- also known as “independent variables”, “covariates”, “predictor”, “regressors”, “feature”,..
- typically always observed (both at training and test time)

- parameters: auxiliary quantities that encode a function mapping inputs to (information on) output

**Model:** logistic regression. Input: temperature. Output: failure indicator (i.e. binary) variable.

**Questions:**

- What does a Bayesian version of logistic regression look like?
- How to predict from it?

```
model {
slope ~ dnorm(0, 0.001)
intercept ~ dnorm(0, 0.001)
for (i in 1:length(temperature)) {
# incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
p[i] <- ilogit(intercept + slope * temperature[i])
incident[i] ~ dbern(p[i])
}
# predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
predictedPr <- ilogit(intercept + slope * 31)
predictedBinary ~ dbern(predictedPr)
}
```

```
require("rjags")
require("coda")
require("ggplot2")
data <- read.csv( file = "challenger_data-train.csv", header = TRUE)
# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is
# deterministic given that initial seed.
my.seed <- 1
set.seed(my.seed)
inits <- list(.RNG.name = "base::Mersenne-Twister",
.RNG.seed = my.seed)
model <- jags.model(
'model.bugs',
data = list( # Pass on the data:
'incident' = data$Damage.Incident,
'temperature' = data$Temperature),
inits=inits)
```

```
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 23
## Unobserved stochastic nodes: 4
## Total graph size: 111
##
## Initializing model
```

```
samples <-
coda.samples(model,
c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
100000) # number of MCMC iterations 100000
summary(samples)
```

```
##
## Iterations = 1001:101000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 1e+05
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## intercept 19.8246 9.25184 0.0292569 0.803987
## predictedBinary 0.9930 0.08361 0.0002644 0.001392
## predictedPr 0.9925 0.04013 0.0001269 0.001341
## slope -0.3032 0.13592 0.0004298 0.011746
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## intercept 5.6143 13.1947 18.4525 25.1527 41.85570
## predictedBinary 1.0000 1.0000 1.0000 1.0000 1.00000
## predictedPr 0.9329 0.9989 0.9999 1.0000 1.00000
## slope -0.6264 -0.3816 -0.2830 -0.2057 -0.09434
```

`plot(samples)`

`print(HPDinterval(samples))`

```
## [[1]]
## lower upper
## intercept 4.0054009 38.42484799
## predictedBinary 1.0000000 1.00000000
## predictedPr 0.9723887 1.00000000
## slope -0.5749028 -0.06893211
## attr(,"Probability")
## [1] 0.95
```

```
# "Jailbreaking" from coda into a saner data frame
chain <- samples[[1]]
intercept <- chain[,1]
slope <- chain[,4]
df <- data.frame("intercept" = as.numeric(intercept), "slope" = as.numeric(slope))
summary(df)
```

```
## intercept slope
## Min. :-2.853 Min. :-0.98311
## 1st Qu.:13.195 1st Qu.:-0.38156
## Median :18.452 Median :-0.28303
## Mean :19.825 Mean :-0.30323
## 3rd Qu.:25.153 3rd Qu.:-0.20573
## Max. :66.041 Max. : 0.02563
```

```
ggplot(df, aes(intercept, slope)) +
geom_density_2d()
```

For a more modern way of using the output of JAGS: http://mjskay.github.io/tidybayes/

Recall our Bayes estimation framework from last week:

\[{\textrm{argmin}}\{{\mathbf{E}}[L(a, Z)|X], a \in {\mathcal{A}}\}\]

- Any assignments of \(Z\) and \(X\) is possible
- \(Z\) contains test outputs; while \(X\) contains all others
- \(Z\) contains parameters; while \(X\) contains all others
- \(Z\) contains parameters, test outputs; while \(X\) contains training outputs, and all inputs
- \(Z\) contains parameters, test outputs; while \(X\) contains training outputs

- When the input variables are ALWAYS observed (both at test and training), it is a waste of time to put a prior \(p({\text{inputs}})\) on them
- Recall: \(\pi({\text{unknowns}}) \propto \gamma({\text{inputs}}, {\text{unknowns}}, {\text{others}})\)
- As usual, “\(\dots \propto \dots\)” means “\(\dots = C \dots\)”, where \(C\) is a constant that does not depend on the unkown variables
- We have by chain rule: \(\gamma({\text{inputs}}, {\text{unknowns}}, {\text{others}}) = p({\text{inputs}}) p({\text{unknowns}}, {\text{others}}| {\text{inputs}}) = C p({\text{unknowns}}, {\text{others}}| {\text{inputs}})\)
- Hence if you have two different priors \(p_1({\text{inputs}})\) and \(p_2({\text{inputs}})\), the posterior will be the same

- Practical situation where does this shortcut does not work?

- Recall: first objective is to predict failure at 31F

- Compute the posterior mean of the slope and intercept, \(\hat a_\text{mean}\) and \(\hat b_\text{mean}\), and return \(\text{logistic}(31 \hat a_\text{mean}+ \hat b_\text{mean})\)
- Compute the MAP of the slope and intercept, \(\hat a_\text{MAP}\) and \(\hat b_\text{MAP}\), and return \(\text{logistic}(31 \hat a_\text{MAP}+ \hat b_\text{MAP})\)
- Compute the posterior median of the slope and intercept, \(\hat a_\text{med}\) and \(\hat b_\text{med}\), and return \(\text{logistic}(31 \hat a_\text{med}+ \hat b_\text{med})\)
- None of the above

- All of these options are examples of the “plug-in approach” and are non-Bayesian
- Frequentist “plug-in approach”
- Find a point estimate on parameters, e.g. slope and intercept \(\hat a\) and \(\hat b\)
- Plug-in into the arguments of the likelihood function of the datapoint to predict, e.g. \(\text{logistic}(31 \hat a+ \hat b)\)

- Bayesian:
- Do not plug-in! (except in some rare situation where it coincides with the strategy below)
- Instead: apply the Bayes estimator!
- Introduce the variable you want to predict into the model
- Use the posterior over that variable
- Introduce a loss function for this prediction

- Here \(z = (\slope, \intercept, y)\), where \(y\in\{0,1\}\) denotes the next launch, i.e. is a single binary variable we seek to predict.
- Let also denote \(a \in \{0, 1\}\) as our action (prediction).
- Let us pick the loss \(L(a, z) = {{\bf 1}}[y \neq a]\).

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[{{\bf 1}}[Y \neq a] | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}(Y \neq a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ 1 - {\mathbb{P}}(Y = a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmax}}\{ {\mathbb{P}}(Y = a | X) : a \in {\mathcal{A}}\} \\ &= \text{MAP estimator} \end{aligned} \]

So we obtain a MAP estimator, but on the predicted binary variable directly (contrast with the MLE plug-in approach, which maximizes continuous parameters first, then plug-in these estimates to predict)

- Look at the data type of the output variable, this will guide the choice of likelihood model
- Are you trying to predict..
- a real number? Replace the Bernoulli by a normal (or better: fat tail distribution such as t distribution, for robustness to outliers)
- an integer? Replace Bernoulli by Poisson (or Negative Binomial, or other integer valued distribution)
- etc

- General pattern:

\[{\text{output}}| {\text{inputs}}, {\text{parameter}}\sim \text{SuitableDistribution}(f({\text{parameter}}, {\text{inputs}}))\]

- \(f\) should map to the range of parameters permitted by your “SuitableDistribution”
- known in the GLM literature as the inverse of the
**link function** - often a composition of a linear function with a non-linear “squashing” function if the output is constrained

- Sometimes, the end goal is not to predict but rather to select a scientific model amongst candidate models
- Example:
- Model 1: the temperature has no effect on the probability of failure
- Model 2: the model we developed today

- Notice: if we set the slope parameter to zero, model 2 recovers model 1 as a special case
- Loss function for model selection?
- Bayes estimator?

The Bayes estimator is

\[{\textrm{argmax}}\{{\mathbb{P}}(\text{slope}=a|X) : a \in \{0, 1\}\}\]

- Yes, this is the optimal approach
- No, instead I would look if the credible intervals on the slope contains 0
- No, because this would always select the more complex model

- Under the prior, \({\mathbb{P}}(\text{slope}=0) = 0\)
- Note:

\[{\mathbb{P}}(A) = 0 \Longrightarrow {\mathbb{P}}(A | B) = \frac{{\mathbb{P}}(A \cap B)}{{\mathbb{P}}(B)} \le \frac{{\mathbb{P}}(A)}{{\mathbb{P}}(B)} = 0 \]

- Hence the simpler model will never be selected!
- The answer based on credible intervals in the last poll is also incorrect: another example of plug-in!

**Mixture**of a continuous prior (for the non-zero slope model) and a point mass at zero. Can be defined as the prior satisfying the following equation for all \(a < b\): \[{\mathbb{P}}(\text{slope} \in [a, b]) = p {{\bf 1}}[0 \in [a, b]] + (1-p) \int_a^b \phi(x) {\text{d}}x,\] where:- \(p\in [0, 1]\) in an additional parameter which controls prior belief in the simpler model, \({\mathbb{P}}(\text{slope}=0) = p\)
- \(\phi\) is the density of the continuous part of the prior (e.g., normal)

- After this change, the Bayes estimator will behave well
- Lesson: if there are posterior probabilities in the Bayes estimator, choose your prior so as to ensure that the correspond prior probabilities are positive

Start reading about one or two PPLs, e.g.:

- Blang, developed in my group, which efficiently supports both discrete, combinatorial and continuous random variables; especially useful for hard non-convex problems. Sections 1 to 6 will be most relevant at this point, but you can have a peak at 7, 10, 11.
- Probabilistic programming for hacker, Cameron Davidson-Pilon
- Probabilistic programming book giving a programmer-centric view of Bayesian analysis
- Many lively examples
- Free ebook: http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/

- Frameworks focussing on autodiff+HMC, e.g. for continuous random variable and nice log-concave targets: Stan, Edward, pyMC3, pyro (some support discrete variable at various degrees of efficiency).
- Languages focussing on maximum expressivity: Anglican, Birch

```
model.name <- "spike-slab.bugs"
{ # Extra bracket needed only for R markdown files
sink(model.name)
cat("
model {
slope ~ dnorm(0, 0.001)
intercept ~ dnorm(0, 0.001)
selected ~ dbern(0.5)
for (i in 1:length(temperature)) {
# incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
p[i] <- ilogit(intercept + selected * slope * temperature[i])
incident[i] ~ dbern(p[i])
}
# predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
predictedPr <- ilogit(intercept + selected * slope * 31)
predictedBinary ~ dbern(predictedPr)
}
",fill = TRUE)
sink()
} # Extra bracket needed only for R markdown files
require("rjags")
require("coda")
require("ggplot2")
data <- read.csv( file = "challenger_data-train.csv", header = TRUE)
# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is
# deterministic given that initial seed.
my.seed <- 1
set.seed(my.seed)
inits <- list(.RNG.name = "base::Mersenne-Twister",
.RNG.seed = my.seed)
model <- jags.model(
model.name,
data = list( # Pass on the data:
'incident' = data$Damage.Incident,
'temperature' = data$Temperature),
inits=inits)
```

```
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 23
## Unobserved stochastic nodes: 5
## Total graph size: 113
##
## Initializing model
```

```
samples <-
coda.samples(model,
c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
100000) # number of MCMC iterations 100000
summary(samples)
```

```
##
## Iterations = 1001:101000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 1e+05
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## intercept 3.0611 8.8648 0.0280329 1.58688
## predictedBinary 0.4461 0.4971 0.0015719 0.02981
## predictedPr 0.4450 0.2887 0.0009129 0.07922
## slope -0.0205 28.1810 0.0891162 0.11244
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## intercept -1.7724 -1.0794 -0.6995 -0.1377 28.65
## predictedBinary 0.0000 0.0000 0.0000 1.0000 1.00
## predictedPr 0.1452 0.2536 0.3319 0.4651 1.00
## slope -59.1470 -15.1741 -0.2526 15.2250 58.56
```

`plot(samples)`