Alexandre Bouchard-Côté
Context:
Data:
data <- read.csv("challenger_data-train.csv")
knitr::kable(data, floating.environment="sidewaystable")
Date | Temperature | Damage.Incident |
---|---|---|
04/12/1981 | 66 | 0 |
11/12/1981 | 70 | 1 |
3/22/82 | 69 | 0 |
6/27/82 | 80 | NA |
01/11/1982 | 68 | 0 |
04/04/1983 | 67 | 0 |
6/18/83 | 72 | 0 |
8/30/83 | 73 | 0 |
11/28/83 | 70 | 0 |
02/03/1984 | 57 | 1 |
04/06/1984 | 63 | 1 |
8/30/84 | 70 | 1 |
10/05/1984 | 78 | 0 |
11/08/1984 | 67 | 0 |
1/24/85 | 53 | 1 |
04/12/1985 | 67 | 0 |
4/29/85 | 75 | 0 |
6/17/85 | 70 | 0 |
7/29/85 | 81 | 0 |
8/27/85 | 76 | 0 |
10/03/1985 | 79 | 0 |
10/30/85 | 75 | 1 |
11/26/85 | 76 | 0 |
01/12/1986 | 58 | 1 |
Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb
Regression models (Bayesian and non-Bayesian) involve the following quantities:
Model: logistic regression. Input: temperature. Output: failure indicator (i.e. binary) variable.
Questions:
model {
slope ~ dnorm(0, 0.001)
intercept ~ dnorm(0, 0.001)
for (i in 1:length(temperature)) {
# incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
p[i] <- ilogit(intercept + slope * temperature[i])
incident[i] ~ dbern(p[i])
}
# predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
predictedPr <- ilogit(intercept + slope * 31)
predictedBinary ~ dbern(predictedPr)
}
require("rjags")
require("coda")
require("ggplot2")
data <- read.csv( file = "challenger_data-train.csv", header = TRUE)
# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is
# deterministic given that initial seed.
my.seed <- 1
set.seed(my.seed)
inits <- list(.RNG.name = "base::Mersenne-Twister",
.RNG.seed = my.seed)
model <- jags.model(
'model.bugs',
data = list( # Pass on the data:
'incident' = data$Damage.Incident,
'temperature' = data$Temperature),
inits=inits)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 23
## Unobserved stochastic nodes: 4
## Total graph size: 111
##
## Initializing model
samples <-
coda.samples(model,
c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
100000) # number of MCMC iterations 100000
summary(samples)
##
## Iterations = 1001:101000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 1e+05
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## intercept 19.8246 9.25184 0.0292569 0.803987
## predictedBinary 0.9930 0.08361 0.0002644 0.001392
## predictedPr 0.9925 0.04013 0.0001269 0.001341
## slope -0.3032 0.13592 0.0004298 0.011746
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## intercept 5.6143 13.1947 18.4525 25.1527 41.85570
## predictedBinary 1.0000 1.0000 1.0000 1.0000 1.00000
## predictedPr 0.9329 0.9989 0.9999 1.0000 1.00000
## slope -0.6264 -0.3816 -0.2830 -0.2057 -0.09434
plot(samples)
print(HPDinterval(samples))
## [[1]]
## lower upper
## intercept 4.0054009 38.42484799
## predictedBinary 1.0000000 1.00000000
## predictedPr 0.9723887 1.00000000
## slope -0.5749028 -0.06893211
## attr(,"Probability")
## [1] 0.95
# "Jailbreaking" from coda into a saner data frame
chain <- samples[[1]]
intercept <- chain[,1]
slope <- chain[,4]
df <- data.frame("intercept" = as.numeric(intercept), "slope" = as.numeric(slope))
summary(df)
## intercept slope
## Min. :-2.853 Min. :-0.98311
## 1st Qu.:13.195 1st Qu.:-0.38156
## Median :18.452 Median :-0.28303
## Mean :19.825 Mean :-0.30323
## 3rd Qu.:25.153 3rd Qu.:-0.20573
## Max. :66.041 Max. : 0.02563
ggplot(df, aes(intercept, slope)) +
geom_density_2d()
For a more modern way of using the output of JAGS: http://mjskay.github.io/tidybayes/
Recall our Bayes estimation framework from last week:
argmin{E[L(a,Z)|X],a∈A}
δ∗(X)=argmin{E[L(a,Z)|X]:a∈A}=argmin{E[1[Y≠a]|X]:a∈A}=argmin{P(Y≠a|X):a∈A}=argmin{1−P(Y=a|X):a∈A}=argmax{P(Y=a|X):a∈A}=MAP estimator
So we obtain a MAP estimator, but on the predicted binary variable directly (contrast with the MLE plug-in approach, which maximizes continuous parameters first, then plug-in these estimates to predict)
output|inputs,parameter∼SuitableDistribution(f(parameter,inputs))
The Bayes estimator is
argmax{P(g(slope)=a|X):a∈{0,1}},g(x)=1{0}(x)
P(A)=0⟹P(A|B)=P(A∩B)P(B)≤P(A)P(B)=0
Start reading about one or two PPLs, e.g.:
model.name <- "spike-slab.bugs"
{ # Extra bracket needed only for R markdown files
sink(model.name)
cat("
model {
slope ~ dnorm(0, 0.001)
intercept ~ dnorm(0, 0.001)
selected ~ dbern(0.5)
for (i in 1:length(temperature)) {
# incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
p[i] <- ilogit(intercept + selected * slope * temperature[i])
incident[i] ~ dbern(p[i])
}
# predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
predictedPr <- ilogit(intercept + selected * slope * 31)
predictedBinary ~ dbern(predictedPr)
}
",fill = TRUE)
sink()
} # Extra bracket needed only for R markdown files
require("rjags")
require("coda")
require("ggplot2")
data <- read.csv( file = "challenger_data-train.csv", header = TRUE)
# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is
# deterministic given that initial seed.
my.seed <- 1
set.seed(my.seed)
inits <- list(.RNG.name = "base::Mersenne-Twister",
.RNG.seed = my.seed)
model <- jags.model(
model.name,
data = list( # Pass on the data:
'incident' = data$Damage.Incident,
'temperature' = data$Temperature),
inits=inits)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 23
## Unobserved stochastic nodes: 5
## Total graph size: 113
##
## Initializing model
samples <-
coda.samples(model,
c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
100000) # number of MCMC iterations 100000
summary(samples)
##
## Iterations = 1001:101000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 1e+05
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## intercept 3.0611 8.8648 0.0280329 1.58688
## predictedBinary 0.4461 0.4971 0.0015719 0.02981
## predictedPr 0.4450 0.2887 0.0009129 0.07922
## slope -0.0205 28.1810 0.0891162 0.11244
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## intercept -1.7724 -1.0794 -0.6995 -0.1377 28.65
## predictedBinary 0.0000 0.0000 0.0000 1.0000 1.00
## predictedPr 0.1452 0.2536 0.3319 0.4651 1.00
## slope -59.1470 -15.1741 -0.2526 15.2250 58.56
plot(samples)
print(HPDinterval(samples))
## [[1]]
## lower upper
## intercept -2.3624604 24.19539
## predictedBinary 0.0000000 1.00000
## predictedPr 0.1685827 1.00000
## slope -60.0940038 57.54338
## attr(,"Probability")
## [1] 0.95
# "Jailbreaking" from coda into a saner data frame
chain <- samples[[1]]
intercept <- chain[,1]
slope <- chain[,4]
df <- data.frame("intercept" = as.numeric(intercept), "slope" = as.numeric(slope))
summary(df)
## intercept slope
## Min. :-3.4135 Min. :-127.3777
## 1st Qu.:-1.0794 1st Qu.: -15.1741
## Median :-0.6995 Median : -0.2526
## Mean : 3.0611 Mean : -0.0205
## 3rd Qu.:-0.1377 3rd Qu.: 15.2250
## Max. :57.1284 Max. : 131.9987
ggplot(df, aes(intercept, slope)) +
geom_density_2d()
First, review the concepts in this topic by going over this exercise set: Regression.html
Second, as an optional extra problem, consider the following alternative approach to the Challenger data: a change-point model.
results/latest/logNormalizationEstimate.csv
an estimate of the quantity logp(data), i.e. the logarithm of the marginal density of the data (use the one obtained from the “stepping stone” estimator, it is also shown in the standard output shown during the execution of the program). As we will see, this quantity is useful to compare different models (higher marginal likelihood means the model is “less surprised by the data”). Use this criterion to decide which of the models we considered seems more adequate.Space, Right Arrow or swipe left to move to next slide, click help below for more details