STAT 520 - Bayesian Analysis
Alexandre Bouchard-Côté
3/6/2019
Goals
Today:
- Finishing up our analysis of a very simple model
- Calibration, continued
- Bernstein von-Mises
- de Finetti
- Into more interesting models
- PPL/MCMC as a black box
- Bayesian GLMs
- AB testing
“Homework” Software setup! Be ready to use next week:
Next week: Bayesian model building
Recall: First way to break consistency: violating Cromwell’s rule
- Suppose we put zero prior mass to having a fair dice. What happens on the posterior?
If we put prior mass of zero to some event, say \(\pi_0(A) = 0\) then the posterior probability of that event will always be zero, \(\pi_n(A) = 0\), no matter how many observations \(n\) we get
Cromwell’s rule (Oliver Cromwell, 1650): “I beseech you, in the bowels of Christ, think it possible that you may be mistaken.”
Recall: Second way to break consistency: Unidentifiable models
When there are too many parameters, you may not be able to recover them even if you had “infinite data”
- Example:
- Let us modify the prior for the rocket example (continuous version), setting \(p = p_1 p_2\) where \(p_1 \sim {\text{Unif}}(0,1)\) and \(p_2 \sim {\text{Unif}}(0,1)\).
- Intuition: there are pairs of \((p_1, p_2)\) that yield the same joint density, e.g. \((p_1, p_2) = (0.1, 0.2)\) and \((p_1, p_2) = (0.05, 0.4)\).
Model code in the Blang PPL
model Unidentifiable {
...
laws {
p1 ~ ContinuousUniform(0, 1)
p2 ~ ContinuousUniform(0, 1)
nFails | p1, p2, nLaunches ~ Binomial(nLaunches, p1 * p2)
}
}
- Observations:
- The posterior does not become degenerate as the sample size increases
- It is now asymmetric. This is not a bug! Can you figure out why?
This is an example of what is called an unidentifiable model. There are two distinct parameters, \(\theta_1\) and \(\theta_2\), such that \(\theta_1 \neq \theta_2\) yet the likelihood are exactly equivalent \({\mathbb{P}}_{\theta_1} = {\mathbb{P}}_{\theta_2}\).
Note: Paul Gustafson is the expert on Bayesian unidentifiable / weakly identifiable models
Recall: Theoretical justification of consistency: Doob’s consistency theorem
In the exchangeable setup, outside of these two failure modes, we are guaranteed to have consistency. The details of the statement of the theorem are somewhat delicate and need to be understood to correctly understand the result.
- Frequentist notions of consistency under \(\theta\):
- setup:
- a parametric model, \(\{{\mathbb{P}}_\theta:\theta \in \Theta\}\), defined on space \(\Omega\)
- observed random variables \(X_i : \Omega \to {\mathbf{R}}^d\)
- estimator \(T_n(x_1, x_2, \dots, x_n)\) for \(\theta\)
- consistency:
- compose the observation r.v. with the estimator: \(T_n = T_n(X_1, X_2, \dots, X_n)\)
- ask that \(T_n \to \theta\) as \(n \to \infty\), under a suitable notion of convergence of random variables…
- either “in \({\mathbb{P}}_\theta\)-probability” or “\({\mathbb{P}}_\theta\) almost sure” (latter AKA a.s. or strong consistency)
- in probability: \(\epsilon > 0, {\mathbb{P}}_\theta(d(T_n, \theta) < \epsilon) \to 1\) for some metric \(d\)
- almost sure: \({\mathbb{P}}_\theta(T_n \to \theta) = 1\)
- frequentists ask that the above holds for all \(\theta\)
- Bayesian notion of consistency under \(\theta\)
- setup:
- assume the “exchangeable setup”
- let \(\Pi_n(\cdot) = \Pi_n(\cdot | X_1, X_2, \dots, X_n)\) denote the (random) posterior (why random: as for the frequentist case, composition of the observations random variable with the posterior update map)
- ask that \(\Pi_n(A) \to \delta_\theta(A)\) for any \(A\), under a suitable notion of convergence of random variables…
- either “in \({\mathbb{P}}_\theta\)-probability” or “\({\mathbb{P}}_\theta\) almost sure” (latter AKA a.s. or strong consistency)
- in probability: \(\epsilon > 0, {\mathbb{P}}_\theta(|\Pi_n(A) - \delta_\theta(A)| < \epsilon) \to 1\)
- almost sure: \({\mathbb{P}}_\theta(\Pi_n(A) \to \delta_\theta(A)) = 1\)
- Bayesians ask that the above holds for a set of “good” \(\theta\)’s, denoted G, such that this good set has probability one under the prior \(\pi_0(G) = 1\).
Theorem (Doob’s consistency): if an exchangeable model is identifiable, Bayesian consistency holds for this model.
Corollary: if you use the wrong prior, \(\tilde \pi_0\), but \(\tilde \pi_0(A) > 0 \Longleftrightarrow \pi_0(A) > 0\), then Doob’s consistency still hold for identifiable exchangeable models.
Note: the same holds true for any point estimate based on the posterior distribution (if you get the posterior right, you will get right anything that’s derived from it)
Recall: calibration of credible intervals
- Recall: notion of prediction calibration, “being right 90% of the time when you say you are 90% sure”
- Today: credible interval calibration
- a calibrated 90% credible interval should contain the true parameter 90% of the time
- Let’s see if this is true using simulated data!
- Let’s check on 1000 small data (10 rocket launches)
- Let’s check on 1000 medium data (100 rocket launches)
- Let’s check on 1000 large data (1000 rocket launches)
- What do you expect?
set.seed(1)
K <- 1000
prior_used_for_computing_posterior <- rep(1/(K+1), K+1) # Use same prior for simulation and posterior computation
n_repeats <- 1000
alpha <- 0.1
hdi_coverage_pr <- function(n_datapoints) {
n_inclusions <- 0
for (repetition in seq(1:n_repeats)) {
i <- rdunif(K + 1) - 1 # Always generate the data using a uniform prior
true_p <- i/K
x <- rbinom(1, n_datapoints, true_p)
post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
# This if is just a hacky way to check if true parameter is in the HDI credible interval
if (sum(abs(true_p - high_density_intervals(alpha, post)[1,]) < 10e-10) == 1) {
n_inclusions <- n_inclusions + 1
}
}
return(n_inclusions/n_repeats) # Fraction of simulation where the true parameter was in interval
}
df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)
ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
ylim(0, 1) +
geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
geom_line()
- Suprise!!: in a Bayesian well-specified context, calibration holds for all dataset sizes!
- Contrast with typical frequentist intervals, usually calibrated only for large enough dataset size
- This is because frequentist interval need to rely on asymptotics to be computable (Central Limit Theorem)
- Bonus 1: this is a great way to test correctness of your software implementation.
- Correct Bayesian inference code: calibrated.
- Buggy Bayesian inference code: almost never calibrated.
- It is nice because you can run your test on small dataset where inference is fast.
- Pick the test just big enough to get code coverage.
- Bonus 2: could you use this to perform model “goodness-of-fit”? How?
- Harness the fact that for a Bayesian, prediction and parameter estimation are done in the same framework
- First, let us pick a simple setup, say iid univariate observations \(y = (y_1, y_2, \dots, y_n)\)
- Let \(y_{-i}\) denote all the observations excluding the \(i\)-th one, \(y_{-i} = (y_1, y_2, \dots, y_{i-1}, y_{i+1}, \dots, y_n)\)
- Idea: if the model is well-specified, \({\mathbb{P}}(Y_{i} \in C_\alpha(Y_{-i})) = \alpha\)
- Approximate LHS using leave-one-out: \[{\mathbb{P}}(Y_i \in C_\alpha(Y_{-i})) \approx \frac{1}{n} \sum_i I[y_{i} \in C_\alpha(y_{-i})]\] where \(I[\cdots]\) denotes the indicator on the event \([\cdot]\), i.e. a function that is 1 if the input outcome \(s\) is in the event, 0 otherwise.
- Special case: cross-validated probability integral transform (PIT), see this paper for practical methods to compute it using importance sampling and other tricks, which uses simple credible interval of the form \(C_\alpha = (-\infty, x]\) (my hunch: it might be better to use HDI; makes no difference if the model is well-specified, but it might in the misspecied case, which matters since we are doing goodness-of-fit!)
- PIT is related to the more commonly used but less formal “posterior predictive check”, where data point \(i\) is not excluded
- easier to compute (only one posterior instead of \(n\) posteriors)
- no theoretical guarantees and hence harder to interpret
- useful as a quick and dirty visualization trick
- Summary:
- People often use “synthetic data” as follows:
- Generate data, hold out the true parameter \(\theta^\star\)
- Report distance between inferred \(\theta\) and \(\theta^\star\)
- Now how to interpret this distance? Could be unclear.
- Message here: there are other ways to use synthetic data in a Bayesian context!
Calibration of credible intervals (misspecified case)
Now suppose we are using the “wrong prior”, i.e. data generation uses uniform prior but we base or posterior computation on the non-uniform prior from earlier
Again, let’s do it for small (10 launches), medium (100), and large datasets (1000), plotting the nominal coverage (dashed) against the actual coverage (solid line)
# Using now the same non-uniform prior as before for posterior calculation
prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)
set.seed(1)
K <- 1000
n_repeats <- 1000
alpha <- 0.1
df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)
ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
ylim(0, 1) +
geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
geom_line()
- Bad news: for small datasets we are no longer calibrated, in the worst possible way
- Higher than dash line: would have meant inference is being conservative, i.e. more right than it actually claimed. That’s not too bad.
- Lower than dash line: we are being overconfident or anti-conservative, which in some case can be reckless
- Good news: this gets quickly corrected as dataset gets larger. Why?
Asymptotic calibration under certain misspecification
- There is no general calibration theory for misspecified models, only special cases (outside of that, use simulation studies or cross-validation techniques)
- Setup in which we have a theorem: graphical model of the following form
Bernstein-von Mises: under certain conditions, even when the prior is misspecified the actual coverage of credible intervals converges to the nominal coverage. (converge just means “gets arbitrarily close”)
Conditions include, but are not limited to:
- \(\theta \in \Theta\) needs to live in a continuous rather than discrete space! I.e.: \(\Theta \subset {\mathbf{R}}^d\)
- Intuition: Bernstein-von Mises actually proves something stronger: convergence of the rescaled, centered posterior to a normal distribution, which is a continuous distribution (can you guess the scaling?)
- “Bayesian Central Limit Theorem”
- \(p(\theta)\) (the prior) is now a density instead of a PMF,
- we assume it is non-zero in a neighborhood of the true parameter
- we assume the posterior mean is consistent
- recall: that means the error goes to zero as number of data points increases
- we assume the model is parametric:
- \(\theta\) is allowed to be a vector, but you are not allowed to make it bigger as the number of observations increases
We will explore these conditions in more detail.
Survey: who is familiar with asymptotic normality of the MLE?
Practical consequence:
- for large datasets, and under certain condition, the prior is “washed away” by the likelihood
- The conditions are not crazy and hold in some situations of interest
- But there are also simple practical situations where the conditions do not hold (e.g. prior assigning zero mass to important regions)
De Finetti: intuition and motivation
- We have invoked the “exchangeable/iid” setup several times already
- De Finetti theorem will give us more insight in this class of models
- Motivation:
- Often, may not want to rely on the order of the rows in the spreadsheet given to you. Just shuffle it to be safe.
- In such circustance, de Finetti says that there exists a well-specified model with the graphical model on the right
- I.e. that there exists a prior such that the data is iid given that prior
- This motivation gives you an idea of the theorem, but is not actually 100% exactly what de Finetti says, let’s formalize the theorem!
Exchangeable random variables
- Recall: notion of equality in distribution \(X_1 {\stackrel{\scriptscriptstyle d}{=}}X_2\)
- …this means the distribution of \(X_1\) is equal to the distribution of \(X_2\)
- …concretely: draw the marginal PMF or density of each random variable, check if they match
- Example where \(X_1 {\stackrel{\scriptscriptstyle d}{=}}X_2\) but \(X_1 \neq X_2\)?
- Highlight to reveal the answer: You flip a coin and say what is on the top \(X_1\). Your friend say what is on the bottom \(X_2\)
- Two random variables are exchangeable if \((X_1, X_2) {\stackrel{\scriptscriptstyle d}{=}}(X_2, X_1)\)
- Example: uniform distributions (i.e. such that the bivariate density is a 2d shape).
- Develop a criterion for checking from the joint density if it’s exchangeable.
- Highlight to reveal the answer: It should be symmetric with respect to the line \(y = x\)
- Show the indicator on (a) a square, (b) a circle, and (c) checkers board, all centered at zero lead to exchangeable random variables. Which one is iid?
- Highlight to reveal the answer: Only the square is iid
- Note: exchangeability implies identical distributions
- Note: this notion is closely related to reversibility
- Generalization: \(n\) variables are exchangeable if for all permutations \(\sigma : \{1, 2, \dots, n\} \to \{1, 2, \dots, n\} \in S_n\), \((X_1, X_2, \dots, X_n) {\stackrel{\scriptscriptstyle d}{=}}(X_{\sigma(1)}, X_{\sigma(2)}, \dots, X_{\sigma(n)})\)
- Infinite exchangeable: an infinite sequence of random variable \((X_1, X_2, \dots)\) is exchangeable if all finite subsets are exchangeable.
Continuous random variables: motivation
- Example: \(p \in [0, 1]\) instead of \(p \in \{0, 1/K, 2/K, \dots, 1\}\)
- Motivations to having continuous random variables:
- Discretiztion becomes exponentially expensive
- We want Bernstein-von Mises to kick in
- Annoying to select a discretization size (\(K\) in our rocket example)
Continuous random variables: first implementation
- Question: how to store the posterior distribution of a continuous random variables in a computer?
- With discrete random variable: we could do it with an array of size \(K\)
- For continuous random variables: use a density (first answer)
Recall: how to use a density to compute a probability?
- Say a random variable \(X\) has density \(f\)
- I want to know the probability that \(X\) is between say 0.1 and 0.3
- Answer:
\[{\mathbb{P}}(X \in [0.1, 0.3]) = \int_{0.1}^{0.3} f(x) {\text{d}}x\]
- At this point, you should have tears in your eyes: integrals are nasty
- No closed form in general
- Even if we pick nice prior with closed form integrals, posterior integrals might be nasty
- Recall in the rocket example, we observed that even if we start with uniform prior posterior is not uniform
- If \(X = (X_1, X_2, \dots, X_d)\) is a random vector (stacked random variables), we need \(d\) iterated integrals
- Numerical approximation (trapezoid, etc) does not scale to models having many dependent random variables
Monte Carlo as a black box
Motivation:
- we want continuous continuous random variables without having to compute integrals (no closed form, >10 dimensions is too slow to do numerical integration)
- we want combinatorial objects without hard sums
However, for simplicity, let us start by assuming the model has only 1 real valued unknown parameter.
Idea:
- Store a list \(S\) of random points called Monte Carlo samples,
- \(S = (X_1, X_2, \dots, X_M)\),
- chosen such that probabilities can be approximated by:
\[{\mathbb{P}}(X \in [0.1, 0.3]) \approx \frac{\# \text{ points in } S \text{ that fall in }[0.1, 0.3]}{M}\]
Intuition. Let’s say you want to compute the area of this weird curved shape \(A\):
Calculus: need the following double integral over the weird shape \(A\) \[\int \int_A {\text{d}}x {\text{d}}y\]
- Monte Carlo method:
- Simulate random points in the square \(S = (X_1, X_2, \dots, X_M)\)
- Return the fraction of the points that fall inside the weird shape \(A\)
Note: all we need to do for each point \(X_i\) is to check if the point falls in the region or not. This is called pointwise evaluation. No need to compute complicated integrals.
How to create Monte Carlo samples \(S\)
Guiding principle: we have two ways to compute probabilities: densities and Monte Carlo. We want that as the size of \(S\) increases, the discrepancy between the two methods should become arbitrary close to zero
- Mathematically: we want as \(M \to \infty\), \[\frac{\# \text{ points in } S \text{ that fall in }[0.1, 0.3]}{M} \to \int_{0.1}^{0.3} f(x) {\text{d}}x\] with “overwhelming probability”.
- This property is called a “Law of Large Numbers”
- There are several ways to get a “Law of Large Numbers” (a fact not emphasized enough!)
- Independent and identically distributed random variables \(X_i\) (the case you may be familiar with, but not the only one!)
- The list of state visited Markov chain with a long term behaviour having density \(f\) (called stationary density)
- But for now you can still treat \(S\) as magically provided by a PPL black box
Computing a mean
\[{\mathbf{E}}[X] = \int x f(x) {\text{d}}x\]
- Monte Carlo way (much easier to implement):
\[{\mathbf{E}}[X] \approx \frac{\sum_{i=1}^M X_i}{M}\]
Again, Laws of Large Numbers link the two
In our weird shape example, separately average the positions for each dimension over the points that fall in the shape.
For variance, use \(\text{Var}[X] = {\mathbf{E}}[X^2] - ({\mathbf{E}}[X])^2\), we have the second term already then for the first term
\[{\mathbf{E}}[X^2] \approx \frac{\sum_{i=1}^M X^2_i}{M}\]
Makes sense: think about it as defining a new random variable \(Y = X^2\) and computing the mean of that transformed random variable…
Generalizing this trick: if I give you a nice enough function \(g\) such as \(g(x) = x^2\),
\[{\mathbf{E}}[g(X)] \approx \frac{\sum_{i=1}^M g(X_i)}{M}\]
More asymptotic regimes
We have done two thought experiments so far:
- Letting the number of observations (rocket launches) go to infinity, \(N \to \infty\)
- Data collection budget
- Often no control on that
- Letting the number of Monte Carlo samples go to infinity, \(M \to \infty\)
- Computational budget
- If not happy with results, easy to increase
- Pop quick: is the picture on the bottom right always a point mass?
Note:
- for “nice” model such as this one, as \(N \to \infty\) the posterior gets arbitrarily peaky (“converges to a point mass”).
- Conditions will be stated next week.
- But there are less nice models where we do not get convergence to a point mass as the data size goes to infinity.
- For example, recall the behaviour for unidentifiable models
- These will often be harder to sample in practice
- People always think about “multi-modal” problem to motivate advanced MCMC
- Multi-modality do arise and cause problem, but in my experience, the more serious problems are rather unidentifiable or weakly identifiable models in practice
Lecture ended here
Moving towards more interesting models: challenger disaster and GLMs
Context:
- Challenger space shuttle exploded 73 seconds after launch in 1986
- Cause: O-ring used in the solid rocket booster failed due to low temperature at the time of launch (31 F [-1 C]; all temperatures in Fahrenheit from now on)
- Question investigated: was it unsafe to authorize the launch given the temperature in the morning of the launch?
Data:
data <- read.csv("challenger_data-train.csv")
knitr::kable(data, floating.environment="sidewaystable")
04/12/1981 |
66 |
0 |
11/12/1981 |
70 |
1 |
3/22/82 |
69 |
0 |
6/27/82 |
80 |
NA |
01/11/1982 |
68 |
0 |
04/04/1983 |
67 |
0 |
6/18/83 |
72 |
0 |
8/30/83 |
73 |
0 |
11/28/83 |
70 |
0 |
02/03/1984 |
57 |
1 |
04/06/1984 |
63 |
1 |
8/30/84 |
70 |
1 |
10/05/1984 |
78 |
0 |
11/08/1984 |
67 |
0 |
1/24/85 |
53 |
1 |
04/12/1985 |
67 |
0 |
4/29/85 |
75 |
0 |
6/17/85 |
70 |
0 |
7/29/85 |
81 |
0 |
8/27/85 |
76 |
0 |
10/03/1985 |
79 |
0 |
10/30/85 |
75 |
1 |
11/26/85 |
76 |
0 |
01/12/1986 |
58 |
1 |
Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb
Bayesian approach and JAGS implementation
Model: logistic regression. Input: temperature. Output: failure indicator (i.e. binary) variable.
Approaches:
- Discretize “slope” and “intercept” into \(K\) bins, draw the decision tree (how many leaves do not get killed?)
- Use the kill-renormalize method to compute \({\mathbb{P}}(\text{predicted binary} = 1|\text{data})\)
- Use MCMC/JAGS!
- This creates a set of Monte Carlo samples \(S = \{X_1, X_2, \dots, X_M\}\)
- Here \(X_m = (\text{slope}_m, \text{intercept}_m, \text{predictedBinary}_m)\)
model {
slope ~ dnorm(0, 0.001)
intercept ~ dnorm(0, 0.001)
for (i in 1:length(temperature)) {
# incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
p[i] <- ilogit(intercept + slope * temperature[i])
incident[i] ~ dbern(p[i])
}
# predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
predictedPr <- ilogit(intercept + slope * 31)
predictedBinary ~ dbern(predictedPr)
}
## Loading required package: rjags
## Loading required package: coda
## Linked to JAGS 4.2.0
## Loaded modules: basemod,bugs
require(coda)
require(ggplot2)
data <- read.csv( file = "challenger_data-train.csv", header = TRUE)
# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is
# deterministic given that initial seed.
my.seed <- 1
set.seed(my.seed)
inits <- list(.RNG.name = "base::Mersenne-Twister",
.RNG.seed = my.seed)
model <- jags.model(
'model.bugs',
data = list( # Pass on the data:
'incident' = data$Damage.Incident,
'temperature' = data$Temperature),
inits=inits)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 23
## Unobserved stochastic nodes: 4
## Total graph size: 111
##
## Initializing model
samples <-
coda.samples(model,
c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
100000) # number of MCMC iterations
summary(samples)
##
## Iterations = 1001:101000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 1e+05
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## intercept 19.8246 9.25184 0.0292569 0.803987
## predictedBinary 0.9930 0.08361 0.0002644 0.001392
## predictedPr 0.9925 0.04013 0.0001269 0.001341
## slope -0.3032 0.13592 0.0004298 0.011746
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## intercept 5.6143 13.1947 18.4525 25.1527 41.85570
## predictedBinary 1.0000 1.0000 1.0000 1.0000 1.00000
## predictedPr 0.9329 0.9989 0.9999 1.0000 1.00000
## slope -0.6264 -0.3816 -0.2830 -0.2057 -0.09434
print(HPDinterval(samples))
## [[1]]
## lower upper
## intercept 4.0054009 38.42484799
## predictedBinary 1.0000000 1.00000000
## predictedPr 0.9723887 1.00000000
## slope -0.5749028 -0.06893211
## attr(,"Probability")
## [1] 0.95
# "Jailbreaking" from coda into a saner data frame
chain <- samples[[1]]
intercept <- chain[,1]
slope <- chain[,4]
df <- data.frame("intercept" = as.numeric(intercept), "slope" = as.numeric(slope))
summary(df)
## intercept slope
## Min. :-2.853 Min. :-0.98311
## 1st Qu.:13.195 1st Qu.:-0.38156
## Median :18.452 Median :-0.28303
## Mean :19.825 Mean :-0.30323
## 3rd Qu.:25.153 3rd Qu.:-0.20573
## Max. :66.041 Max. : 0.02563
ggplot(df, aes(intercept, slope)) +
geom_density_2d()
Some things to pay particular attention to:
- Credible intervals do not “spill” outside of \([0, 1]\) in contrast to most frequentist-derived confidence intervals
- Compared to last time, added the “predictedPr” variable to the set to be monitored.
- Added bivariate posterior viz via “jail breaking” coda
- Refresher on marginal and joint:
Second new example
In its early days, Google used a method in the spirit of what is described below to tweak their website. Let us say that they want to decide if they should use a white background or a black background (or some other cosmetic feature).
- Each time enters the website for the first time, flip a fair (virtual) coin.
- If it is head, show the website with a black background (option A), else, show the one with a white background (option B).
- Record if the user click on at least one search result (as a surrogate to whether they “liked” the website or not).
After a short interval of time, they record the following results:
- number of times A was shown: 51
- number of times A was shown and the user clicked: 45
- number of times B was shown: 47
- number of times B was shown and the user clicked: 37
Think about the following questions:
- What would be a model suitable for a Bayesian analysis of this problem?
- How to make a decision (A or B) based on the MCMC output?
- Optional: implement the idea using JAGS.
require(rjags)
require(coda)
require(ggplot2)
modelstring="
model {
pA ~ dunif(0, 1)
pB ~ dunif(0, 1)
sA ~ dbin(pA, nA)
sB ~ dbin(pB, nB)
AisBetterThanB <- ifelse(pA > pB, 1, 0)
}
"
# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is
# deterministic given that initial seed.
my.seed <- 1
set.seed(my.seed)
inits <- list(.RNG.name = "base::Mersenne-Twister",
.RNG.seed = my.seed)
model <- jags.model(
textConnection(modelstring),
data = list( # Pass on the data:
'nA' = 51,
'sA' = 45,
'nB' = 47,
'sB' = 37),
inits=inits)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 2
## Unobserved stochastic nodes: 2
## Total graph size: 16
##
## Initializing model
samples <-
coda.samples(model,
c('pA', 'pB', 'AisBetterThanB'), # These are the variables we want to monitor (plot, etc)
10000) # number of MCMC iterations
summary(samples)
##
## Iterations = 1:10000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## AisBetterThanB 0.8904 0.31241 0.0031241 0.0032970
## pA 0.8673 0.04641 0.0004641 0.0004566
## pB 0.7762 0.05888 0.0005888 0.0005689
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## AisBetterThanB 0.0000 1.0000 1.0000 1.0000 1.0000
## pA 0.7620 0.8394 0.8720 0.9007 0.9439
## pB 0.6529 0.7388 0.7799 0.8179 0.8803
print(HPDinterval(samples))
## [[1]]
## lower upper
## AisBetterThanB 0.0000000 1.0000000
## pA 0.7743609 0.9493196
## pB 0.6578581 0.8831351
## attr(,"Probability")
## [1] 0.95
chain <- samples[[1]]
pA <- chain[,2]
pB <- chain[,3]
df <- data.frame("pA" = as.numeric(pA), "pB" = as.numeric(pB))
summary(df)
## pA pB
## Min. :0.5934 Min. :0.5235
## 1st Qu.:0.8394 1st Qu.:0.7388
## Median :0.8720 Median :0.7799
## Mean :0.8673 Mean :0.7762
## 3rd Qu.:0.9007 3rd Qu.:0.8179
## Max. :0.9788 Max. :0.9465
ggplot(df, aes(pA, pB)) +
xlim(0, 1) +
ylim(0, 1) +
geom_abline(intercept = 0, slope = 1) +
geom_point(size = 0.3, alpha = 0.3)
ggplot(df, aes(pA, pB)) +
xlim(0, 1) +
ylim(0, 1) +
geom_abline(intercept = 0, slope = 1) +
geom_density_2d()
- Fraction of the points in \(S\) that fall in the lower triangle approximates \({\mathbb{P}}(p_A > p_B | \text{data})\)
- This approximates the integral of the posterior density in the region \(p_A > p_B\):
\[\frac{\text{# points in lower triangle}}{M} \approx {\mathbb{P}}(p_A > p_B | \text{data}) = \int \int_{x > y} f_{p_A, p_B|\text{data}}(x, y) {\text{d}}x {\text{d}}y\]
- The probability \({\mathbb{P}}(p_A > p_B | \text{data})\) can be thought of as Bayesian counterpart for “significance”
- In practice, you probably also want to report a measure of effect size.
- If the difference is very small between \(p_A\) and \(p_B\), but you have tons of data \({\mathbb{P}}(p_A > p_B | \text{data})\) will still be close to one.
- Effect size: roughly, how big is the difference \(\text{diff} = p_A - p_B\)?
- Significance: how reliability have we inferred the sign of the difference?
- How would you approach the problem of assessing effect size in our AB model?
Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb