Recall: Second way to break consistency: Unidentifiable models

When there are too many parameters, you may not be able to recover them even if you had “infinite data”
Example:
- Let us modify the prior for the rocket example (continuous version), setting $p = p_1 p_2$ where $p_1 \sim {\text{Unif}}(0,1)$ and $p_2 \sim {\text{Unif}}(0,1)$.
- Intuition: there are pairs of $(p_1, p_2)$ that yield the same joint density, e.g. $(p_1, p_2) = (0.1, 0.2)$ and $(p_1, p_2) = (0.05, 0.4)$.
Model code in the Blang PPL

model Unidentifiable {
  ...
  laws {
    p1 ~ ContinuousUniform(0, 1)
    p2 ~ ContinuousUniform(0, 1)
    nFails | p1, p2, nLaunches ~ Binomial(nLaunches, p1 * p2)
  }
}

Results

drawing

Observations:
- The posterior does not become degenerate as the sample size increases
- It is now asymmetric. This is not a bug! Can you figure out why?
This is an example of what is called an unidentifiable model. There are two distinct parameters, $\theta_1$ and $\theta_2$, such that $\theta_1 \neq \theta_2$ yet the likelihood are exactly equivalent ${\mathbb{P}}_{\theta_1} = {\mathbb{P}}_{\theta_2}$.
Note: Paul Gustafson is the expert on Bayesian unidentifiable / weakly identifiable models

Recall: Theoretical justification of consistency: Doob’s consistency theorem

In the exchangeable setup, outside of these two failure modes, we are guaranteed to have consistency. The details of the statement of the theorem are somewhat delicate and need to be understood to correctly understand the result.
Frequentist notions of consistency under $\theta$:
- setup:
  - a parametric model, $\{{\mathbb{P}}_\theta:\theta \in \Theta\}$, defined on space $\Omega$
  - observed random variables $X_i : \Omega \to {\mathbf{R}}^d$
  - estimator $T_n(x_1, x_2, \dots, x_n)$ for $\theta$
- consistency:
  - compose the observation r.v. with the estimator: $T_n = T_n(X_1, X_2, \dots, X_n)$
  - ask that $T_n \to \theta$ as $n \to \infty$, under a suitable notion of convergence of random variables…
  - either “in ${\mathbb{P}}_\theta$-probability” or “${\mathbb{P}}_\theta$ almost sure” (latter AKA a.s. or strong consistency)
    - in probability: $\epsilon > 0, {\mathbb{P}}_\theta(d(T_n, \theta) < \epsilon) \to 1$ for some metric $d$
    - almost sure: ${\mathbb{P}}_\theta(T_n \to \theta) = 1$
  - frequentists ask that the above holds for all $\theta$
Bayesian notion of consistency under $\theta$
- setup:
  - assume the “exchangeable setup”
  - let $\Pi_n(\cdot) = \Pi_n(\cdot | X_1, X_2, \dots, X_n)$ denote the (random) posterior (why random: as for the frequentist case, composition of the observations random variable with the posterior update map)
  - ask that $\Pi_n(A) \to \delta_\theta(A)$ for any $A$, under a suitable notion of convergence of random variables…
  - either “in ${\mathbb{P}}_\theta$-probability” or “${\mathbb{P}}_\theta$ almost sure” (latter AKA a.s. or strong consistency)
    - in probability: $\epsilon > 0, {\mathbb{P}}_\theta(|\Pi_n(A) - \delta_\theta(A)| < \epsilon) \to 1$
    - almost sure: ${\mathbb{P}}_\theta(\Pi_n(A) \to \delta_\theta(A)) = 1$
  - Bayesians ask that the above holds for a set of “good” $\theta$’s, denoted G, such that this good set has probability one under the prior $\pi_0(G) = 1$.

Theorem (Doob’s consistency): if an exchangeable model is identifiable, Bayesian consistency holds for this model.

Corollary: if you use the wrong prior, $\tilde \pi_0$, but $\tilde \pi_0(A) > 0 \Longleftrightarrow \pi_0(A) > 0$, then Doob’s consistency still hold for identifiable exchangeable models.

Note: the same holds true for any point estimate based on the posterior distribution (if you get the posterior right, you will get right anything that’s derived from it)

Recall: calibration of credible intervals

Recall: notion of prediction calibration, “being right 90% of the time when you say you are 90% sure”
Today: credible interval calibration
- a calibrated 90% credible interval should contain the true parameter 90% of the time
Let’s see if this is true using simulated data!
- Let’s check on 1000 small data (10 rocket launches)
- Let’s check on 1000 medium data (100 rocket launches)
- Let’s check on 1000 large data (1000 rocket launches)
What do you expect?

set.seed(1)
K <- 1000
prior_used_for_computing_posterior <- rep(1/(K+1), K+1) # Use same prior for simulation and posterior computation
n_repeats <- 1000
alpha <- 0.1

hdi_coverage_pr <- function(n_datapoints) {
  n_inclusions <- 0
  for (repetition in seq(1:n_repeats)) {
    i <- rdunif(K + 1) - 1  # Always generate the data using a uniform prior
    true_p <- i/K
    x <- rbinom(1, n_datapoints, true_p)
    post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
    
    # This if is just a hacky way to check if true parameter is in the HDI credible interval
    if (sum(abs(true_p - high_density_intervals(alpha, post)[1,]) < 10e-10) == 1) {
      n_inclusions <- n_inclusions + 1
    }
  }
  return(n_inclusions/n_repeats) # Fraction of simulation where the true parameter was in interval
}

df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)

ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
  ylim(0, 1) +
  geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
  geom_line()

Suprise!!: in a Bayesian well-specified context, calibration holds for all dataset sizes!
- Contrast with typical frequentist intervals, usually calibrated only for large enough dataset size
- This is because frequentist interval need to rely on asymptotics to be computable (Central Limit Theorem)
Bonus 1: this is a great way to test correctness of your software implementation.
- Correct Bayesian inference code: calibrated.
- Buggy Bayesian inference code: almost never calibrated.
- It is nice because you can run your test on small dataset where inference is fast.
  - Pick the test just big enough to get code coverage.
Bonus 2: could you use this to perform model “goodness-of-fit”? How?
- Harness the fact that for a Bayesian, prediction and parameter estimation are done in the same framework
- First, let us pick a simple setup, say iid univariate observations $y = (y_1, y_2, \dots, y_n)$
- Let $y_{-i}$ denote all the observations excluding the $i$-th one, $y_{-i} = (y_1, y_2, \dots, y_{i-1}, y_{i+1}, \dots, y_n)$
- Idea: if the model is well-specified, ${\mathbb{P}}(Y_{i} \in C_\alpha(Y_{-i})) = \alpha$
- Approximate LHS using leave-one-out: \[{\mathbb{P}}(Y_i \in C_\alpha(Y_{-i})) \approx \frac{1}{n} \sum_i I[y_{i} \in C_\alpha(y_{-i})]\] where $I[\cdots]$ denotes the indicator on the event $[\cdot]$, i.e. a function that is 1 if the input outcome $s$ is in the event, 0 otherwise.
- Special case: cross-validated probability integral transform (PIT), see this paper for practical methods to compute it using importance sampling and other tricks, which uses simple credible interval of the form $C_\alpha = (-\infty, x]$ (my hunch: it might be better to use HDI; makes no difference if the model is well-specified, but it might in the misspecied case, which matters since we are doing goodness-of-fit!)
- PIT is related to the more commonly used but less formal “posterior predictive check”, where data point $i$ is not excluded
  - easier to compute (only one posterior instead of $n$ posteriors)
  - no theoretical guarantees and hence harder to interpret
  - useful as a quick and dirty visualization trick
Summary:
- People often use “synthetic data” as follows:
  - Generate data, hold out the true parameter $\theta^\star$
  - Report distance between inferred $\theta$ and $\theta^\star$
  - Now how to interpret this distance? Could be unclear.
- Message here: there are other ways to use synthetic data in a Bayesian context!

Calibration of credible intervals (misspecified case)

Now suppose we are using the “wrong prior”, i.e. data generation uses uniform prior but we base or posterior computation on the non-uniform prior from earlier
Again, let’s do it for small (10 launches), medium (100), and large datasets (1000), plotting the nominal coverage (dashed) against the actual coverage (solid line)

# Using now the same non-uniform prior as before for posterior calculation
prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)

set.seed(1)
K <- 1000
n_repeats <- 1000
alpha <- 0.1

df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)

ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
  ylim(0, 1) +
  geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
  geom_line()

Bad news: for small datasets we are no longer calibrated, in the worst possible way
- Higher than dash line: would have meant inference is being conservative, i.e. more right than it actually claimed. That’s not too bad.
- Lower than dash line: we are being overconfident or anti-conservative, which in some case can be reckless
Good news: this gets quickly corrected as dataset gets larger. Why?

Asymptotic calibration under certain misspecification

There is no general calibration theory for misspecified models, only special cases (outside of that, use simulation studies or cross-validation techniques)
Setup in which we have a theorem: graphical model of the following form

drawing

Bernstein-von Mises: under certain conditions, even when the prior is misspecified the actual coverage of credible intervals converges to the nominal coverage. (converge just means “gets arbitrarily close”)

Conditions include, but are not limited to:

$\theta \in \Theta$ needs to live in a continuous rather than discrete space! I.e.: $\Theta \subset {\mathbf{R}}^d$
- Intuition: Bernstein-von Mises actually proves something stronger: convergence of the rescaled, centered posterior to a normal distribution, which is a continuous distribution (can you guess the scaling?)
- “Bayesian Central Limit Theorem”
$p(\theta)$ (the prior) is now a density instead of a PMF,
- we assume it is non-zero in a neighborhood of the true parameter
we assume the posterior mean is consistent
- recall: that means the error goes to zero as number of data points increases
we assume the model is parametric:
- $\theta$ is allowed to be a vector, but you are not allowed to make it bigger as the number of observations increases

We will explore these conditions in more detail.

Survey: who is familiar with asymptotic normality of the MLE?

Practical consequence:

for large datasets, and under certain condition, the prior is “washed away” by the likelihood
- The conditions are not crazy and hold in some situations of interest
- But there are also simple practical situations where the conditions do not hold (e.g. prior assigning zero mass to important regions)

De Finetti: intuition and motivation

We have invoked the “exchangeable/iid” setup several times already
De Finetti theorem will give us more insight in this class of models
Motivation:

drawing

Often, may not want to rely on the order of the rows in the spreadsheet given to you. Just shuffle it to be safe.
In such circustance, de Finetti says that there exists a well-specified model with the graphical model on the right
- I.e. that there exists a prior such that the data is iid given that prior
- This motivation gives you an idea of the theorem, but is not actually 100% exactly what de Finetti says, let’s formalize the theorem!

Exchangeable random variables

Recall: notion of equality in distribution $X_1 {\stackrel{\scriptscriptstyle d}{=}}X_2$
- …this means the distribution of $X_1$ is equal to the distribution of $X_2$
- …concretely: draw the marginal PMF or density of each random variable, check if they match
- Example where $X_1 {\stackrel{\scriptscriptstyle d}{=}}X_2$ but $X_1 \neq X_2$?
- Highlight to reveal the answer: You flip a coin and say what is on the top $X_1$. Your friend say what is on the bottom $X_2$
Two random variables are exchangeable if $(X_1, X_2) {\stackrel{\scriptscriptstyle d}{=}}(X_2, X_1)$
- Example: uniform distributions (i.e. such that the bivariate density is a 2d shape).
  - Develop a criterion for checking from the joint density if it’s exchangeable.
    - Highlight to reveal the answer: It should be symmetric with respect to the line $y = x$
  - Show the indicator on (a) a square, (b) a circle, and (c) checkers board, all centered at zero lead to exchangeable random variables. Which one is iid?
    - Highlight to reveal the answer: Only the square is iid
- Note: exchangeability implies identical distributions
- Note: this notion is closely related to reversibility
Generalization: $n$ variables are exchangeable if for all permutations $\sigma : \{1, 2, \dots, n\} \to \{1, 2, \dots, n\} \in S_n$, $(X_1, X_2, \dots, X_n) {\stackrel{\scriptscriptstyle d}{=}}(X_{\sigma(1)}, X_{\sigma(2)}, \dots, X_{\sigma(n)})$
Infinite exchangeable: an infinite sequence of random variable $(X_1, X_2, \dots)$ is exchangeable if all finite subsets are exchangeable.

De Finetti: formal statement

Theorem: if $(X_1, X_2, \dots)$ are exchangeable Bernoulli random variables, then there exists a $\theta$, a prior on $\theta$, and a likelihood such that the observations are iid given $\theta$.

Note: This is not guaranteed to hold for finite $n$ (e.g., works for the checker board, not possible for the circle)
Leads to a philosophy: “Stochastic process based design of Bayesian models”
- Construct models containing an infinite number of observations a priori
- Condition on only the subset you actually observed
- Ideally, construct the model such that you can analytically marginalize the infinite suffix of unobserved data
  - Note: it is trivial to marginalization a node $X_i$ in a graphical model such that there is no directed path from $X_i$ to any observation

Continuous random variables: motivation

Example: $p \in [0, 1]$ instead of $p \in \{0, 1/K, 2/K, \dots, 1\}$
Motivations to having continuous random variables:
- Discretiztion becomes exponentially expensive
- We want Bernstein-von Mises to kick in
- Annoying to select a discretization size ($K$ in our rocket example)

Continuous random variables: first implementation

Question: how to store the posterior distribution of a continuous random variables in a computer?
- With discrete random variable: we could do it with an array of size $K$
- For continuous random variables: use a density (first answer)

Recall: how to use a density to compute a probability?

Say a random variable $X$ has density $f$
I want to know the probability that $X$ is between say 0.1 and 0.3
Answer:

drawing

\[{\mathbb{P}}(X \in [0.1, 0.3]) = \int_{0.1}^{0.3} f(x) {\text{d}}x\] drawing

At this point, you should have tears in your eyes: integrals are nasty
- No closed form in general
  - Even if we pick nice prior with closed form integrals, posterior integrals might be nasty
  - Recall in the rocket example, we observed that even if we start with uniform prior posterior is not uniform
- If $X = (X_1, X_2, \dots, X_d)$ is a random vector (stacked random variables), we need $d$ iterated integrals
- Numerical approximation (trapezoid, etc) does not scale to models having many dependent random variables

Monte Carlo as a black box

Motivation:

we want continuous continuous random variables without having to compute integrals (no closed form, >10 dimensions is too slow to do numerical integration)
we want combinatorial objects without hard sums

However, for simplicity, let us start by assuming the model has only 1 real valued unknown parameter.

Idea:

Store a list $S$ of random points called Monte Carlo samples,
- $S = (X_1, X_2, \dots, X_M)$,
chosen such that probabilities can be approximated by:

\[{\mathbb{P}}(X \in [0.1, 0.3]) \approx \frac{\# \text{ points in } S \text{ that fall in }[0.1, 0.3]}{M}\]

Intuition. Let’s say you want to compute the area of this weird curved shape $A$:

drawing

Calculus: need the following double integral over the weird shape $A$ \[\int \int_A {\text{d}}x {\text{d}}y\]
Monte Carlo method:
- Simulate random points in the square $S = (X_1, X_2, \dots, X_M)$
- Return the fraction of the points that fall inside the weird shape $A$
Note: all we need to do for each point $X_i$ is to check if the point falls in the region or not. This is called pointwise evaluation. No need to compute complicated integrals.

How to create Monte Carlo samples $S$

Guiding principle: we have two ways to compute probabilities: densities and Monte Carlo. We want that as the size of $S$ increases, the discrepancy between the two methods should become arbitrary close to zero

Mathematically: we want as $M \to \infty$, \[\frac{\# \text{ points in } S \text{ that fall in }[0.1, 0.3]}{M} \to \int_{0.1}^{0.3} f(x) {\text{d}}x\] with “overwhelming probability”.
This property is called a “Law of Large Numbers”
There are several ways to get a “Law of Large Numbers” (a fact not emphasized enough!)
- Independent and identically distributed random variables $X_i$ (the case you may be familiar with, but not the only one!)
- The list of state visited Markov chain with a long term behaviour having density $f$ (called stationary density)
But for now you can still treat $S$ as magically provided by a PPL black box

Computing a mean

drawing

In our weird shape example, computing mean is like: (1) take cisors and cut the shape (2) find the point where we would put our finger under and it would keep the shape horizontally
Density way (tears):

\[{\mathbf{E}}[X] = \int x f(x) {\text{d}}x\]

Monte Carlo way (much easier to implement):

\[{\mathbf{E}}[X] \approx \frac{\sum_{i=1}^M X_i}{M}\]

Again, Laws of Large Numbers link the two
In our weird shape example, separately average the positions for each dimension over the points that fall in the shape.
For variance, use $\text{Var}[X] = {\mathbf{E}}[X^2] - ({\mathbf{E}}[X])^2$, we have the second term already then for the first term

\[{\mathbf{E}}[X^2] \approx \frac{\sum_{i=1}^M X^2_i}{M}\]

Makes sense: think about it as defining a new random variable $Y = X^2$ and computing the mean of that transformed random variable…
Generalizing this trick: if I give you a nice enough function $g$ such as $g(x) = x^2$,

\[{\mathbf{E}}[g(X)] \approx \frac{\sum_{i=1}^M g(X_i)}{M}\]

Again, all we need is “pointwise evaluation”, even if $g$ is complicated.
Now what if…:
- There are several real valued unknown parameters?
- The unknown parameter is not real valued? E.g.: a random colour or a discrete structure such as graph, etc.

More asymptotic regimes

We have done two thought experiments so far:

Letting the number of observations (rocket launches) go to infinity, $N \to \infty$
- Data collection budget
- Often no control on that
Letting the number of Monte Carlo samples go to infinity, $M \to \infty$
- Computational budget
- If not happy with results, easy to increase

drawing

Pop quick: is the picture on the bottom right always a point mass?

Note:

for “nice” model such as this one, as $N \to \infty$ the posterior gets arbitrarily peaky (“converges to a point mass”).
Conditions will be stated next week.
But there are less nice models where we do not get convergence to a point mass as the data size goes to infinity.
For example, recall the behaviour for unidentifiable models

drawing

These will often be harder to sample in practice
- People always think about “multi-modal” problem to motivate advanced MCMC
- Multi-modality do arise and cause problem, but in my experience, the more serious problems are rather unidentifiable or weakly identifiable models in practice

Lecture ended here

Moving towards more interesting models: challenger disaster and GLMs

drawing

Context:

Challenger space shuttle exploded 73 seconds after launch in 1986
Cause: O-ring used in the solid rocket booster failed due to low temperature at the time of launch (31 F [-1 C]; all temperatures in Fahrenheit from now on)
Question investigated: was it unsafe to authorize the launch given the temperature in the morning of the launch?

Data:

drawing

data <- read.csv("challenger_data-train.csv")
knitr::kable(data, floating.environment="sidewaystable")

Date	Temperature	Damage.Incident
04/12/1981	66	0
11/12/1981	70	1
3/22/82	69	0
6/27/82	80	NA
01/11/1982	68	0
04/04/1983	67	0
6/18/83	72	0
8/30/83	73	0
11/28/83	70	0
02/03/1984	57	1
04/06/1984	63	1
8/30/84	70	1
10/05/1984	78	0
11/08/1984	67	0
1/24/85	53	1
04/12/1985	67	0
4/29/85	75	0
6/17/85	70	0
7/29/85	81	0
8/27/85	76	0
10/03/1985	79	0
10/30/85	75	1
11/26/85	76	0
01/12/1986	58	1

Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb

Bayesian approach and JAGS implementation

Model: logistic regression. Input: temperature. Output: failure indicator (i.e. binary) variable.

drawing

Approaches:

Discretize “slope” and “intercept” into $K$ bins, draw the decision tree (how many leaves do not get killed?)
- Use the kill-renormalize method to compute ${\mathbb{P}}(\text{predicted binary} = 1|\text{data})$
Use MCMC/JAGS!
- This creates a set of Monte Carlo samples $S = \{X_1, X_2, \dots, X_M\}$
- Here $X_m = (\text{slope}_m, \text{intercept}_m, \text{predictedBinary}_m)$

model {

  slope ~ dnorm(0, 0.001)
  intercept ~ dnorm(0, 0.001)

  for (i in 1:length(temperature)) {
    # incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
    p[i] <- ilogit(intercept + slope * temperature[i])
    incident[i] ~ dbern(p[i])
  }

  # predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
  predictedPr <- ilogit(intercept + slope * 31)
  predictedBinary ~ dbern(predictedPr)
}

require(rjags)

## Loading required package: rjags

## Loading required package: coda

## Linked to JAGS 4.2.0

## Loaded modules: basemod,bugs

require(coda)
require(ggplot2)

data <- read.csv( file = "challenger_data-train.csv", header = TRUE)

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
my.seed <- 1 
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  'model.bugs', 
  data = list( # Pass on the data:
    'incident' = data$Damage.Incident, 
    'temperature' = data$Temperature), 
  inits=inits)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 23
##    Unobserved stochastic nodes: 4
##    Total graph size: 111
## 
## Initializing model

samples <- 
  coda.samples(model,
               c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
               100000) # number of MCMC iterations
summary(samples)

## 
## Iterations = 1001:101000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 1e+05 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                    Mean      SD  Naive SE Time-series SE
## intercept       19.8246 9.25184 0.0292569       0.803987
## predictedBinary  0.9930 0.08361 0.0002644       0.001392
## predictedPr      0.9925 0.04013 0.0001269       0.001341
## slope           -0.3032 0.13592 0.0004298       0.011746
## 
## 2. Quantiles for each variable:
## 
##                    2.5%     25%     50%     75%    97.5%
## intercept        5.6143 13.1947 18.4525 25.1527 41.85570
## predictedBinary  1.0000  1.0000  1.0000  1.0000  1.00000
## predictedPr      0.9329  0.9989  0.9999  1.0000  1.00000
## slope           -0.6264 -0.3816 -0.2830 -0.2057 -0.09434

plot(samples)

print(HPDinterval(samples))

## [[1]]
##                      lower       upper
## intercept        4.0054009 38.42484799
## predictedBinary  1.0000000  1.00000000
## predictedPr      0.9723887  1.00000000
## slope           -0.5749028 -0.06893211
## attr(,"Probability")
## [1] 0.95

# "Jailbreaking" from coda into a saner data frame
chain <- samples[[1]]
intercept <- chain[,1]
slope <- chain[,4]
df <- data.frame("intercept" = as.numeric(intercept), "slope" = as.numeric(slope))
summary(df)

##    intercept          slope         
##  Min.   :-2.853   Min.   :-0.98311  
##  1st Qu.:13.195   1st Qu.:-0.38156  
##  Median :18.452   Median :-0.28303  
##  Mean   :19.825   Mean   :-0.30323  
##  3rd Qu.:25.153   3rd Qu.:-0.20573  
##  Max.   :66.041   Max.   : 0.02563

ggplot(df, aes(intercept, slope)) +
  geom_density_2d()

Some things to pay particular attention to:

Credible intervals do not “spill” outside of $[0, 1]$ in contrast to most frequentist-derived confidence intervals
Compared to last time, added the “predictedPr” variable to the set to be monitored.
Added bivariate posterior viz via “jail breaking” coda
- Refresher on marginal and joint:

$drawing$

Second new example

In its early days, Google used a method in the spirit of what is described below to tweak their website. Let us say that they want to decide if they should use a white background or a black background (or some other cosmetic feature).

Each time enters the website for the first time, flip a fair (virtual) coin.
If it is head, show the website with a black background (option A), else, show the one with a white background (option B).
Record if the user click on at least one search result (as a surrogate to whether they “liked” the website or not).

After a short interval of time, they record the following results:

number of times A was shown: 51
number of times A was shown and the user clicked: 45
number of times B was shown: 47
number of times B was shown and the user clicked: 37

Think about the following questions:

What would be a model suitable for a Bayesian analysis of this problem?
How to make a decision (A or B) based on the MCMC output?
Optional: implement the idea using JAGS.

require(rjags)
require(coda)
require(ggplot2)

modelstring="
  model {
  
    pA ~ dunif(0, 1)
    pB ~ dunif(0, 1)
  
    sA ~ dbin(pA, nA)
    sB ~ dbin(pB, nB)
    
    AisBetterThanB <- ifelse(pA > pB, 1, 0)
  }
"

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
my.seed <- 1 
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  textConnection(modelstring), 
  data = list( # Pass on the data:
    'nA' = 51,
    'sA' = 45, 
    'nB' = 47,
    'sB' = 37), 
  inits=inits)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 2
##    Unobserved stochastic nodes: 2
##    Total graph size: 16
## 
## Initializing model

samples <- 
  coda.samples(model,
               c('pA', 'pB', 'AisBetterThanB'), # These are the variables we want to monitor (plot, etc)
               10000) # number of MCMC iterations
summary(samples)

## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                  Mean      SD  Naive SE Time-series SE
## AisBetterThanB 0.8904 0.31241 0.0031241      0.0032970
## pA             0.8673 0.04641 0.0004641      0.0004566
## pB             0.7762 0.05888 0.0005888      0.0005689
## 
## 2. Quantiles for each variable:
## 
##                  2.5%    25%    50%    75%  97.5%
## AisBetterThanB 0.0000 1.0000 1.0000 1.0000 1.0000
## pA             0.7620 0.8394 0.8720 0.9007 0.9439
## pB             0.6529 0.7388 0.7799 0.8179 0.8803

plot(samples)

print(HPDinterval(samples))

## [[1]]
##                    lower     upper
## AisBetterThanB 0.0000000 1.0000000
## pA             0.7743609 0.9493196
## pB             0.6578581 0.8831351
## attr(,"Probability")
## [1] 0.95

chain <- samples[[1]]
pA <- chain[,2]
pB <- chain[,3]
df <- data.frame("pA" = as.numeric(pA), "pB" = as.numeric(pB))
summary(df)

##        pA               pB        
##  Min.   :0.5934   Min.   :0.5235  
##  1st Qu.:0.8394   1st Qu.:0.7388  
##  Median :0.8720   Median :0.7799  
##  Mean   :0.8673   Mean   :0.7762  
##  3rd Qu.:0.9007   3rd Qu.:0.8179  
##  Max.   :0.9788   Max.   :0.9465

ggplot(df, aes(pA, pB)) +
  xlim(0, 1) +
  ylim(0, 1) +
  geom_abline(intercept = 0, slope = 1) +
  geom_point(size = 0.3, alpha = 0.3)

ggplot(df, aes(pA, pB)) +
  xlim(0, 1) +
  ylim(0, 1) +
  geom_abline(intercept = 0, slope = 1) +
  geom_density_2d()

Fraction of the points in $S$ that fall in the lower triangle approximates ${\mathbb{P}}(p_A > p_B | \text{data})$
This approximates the integral of the posterior density in the region $p_A > p_B$:

\[\frac{\text{# points in lower triangle}}{M} \approx {\mathbb{P}}(p_A > p_B | \text{data}) = \int \int_{x > y} f_{p_A, p_B|\text{data}}(x, y) {\text{d}}x {\text{d}}y\]

The probability ${\mathbb{P}}(p_A > p_B | \text{data})$ can be thought of as Bayesian counterpart for “significance”
In practice, you probably also want to report a measure of effect size.
- If the difference is very small between $p_A$ and $p_B$, but you have tons of data ${\mathbb{P}}(p_A > p_B | \text{data})$ will still be close to one.
- Effect size: roughly, how big is the difference $\text{diff} = p_A - p_B$?
- Significance: how reliability have we inferred the sign of the difference?
- How would you approach the problem of assessing effect size in our AB model?

Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb

STAT 520 - Bayesian Analysis

Goals

Logistics

Recall: First way to break consistency: violating Cromwell’s rule

Recall: Second way to break consistency: Unidentifiable models

Recall: Theoretical justification of consistency: Doob’s consistency theorem

Recall: calibration of credible intervals

Calibration of credible intervals (misspecified case)

Asymptotic calibration under certain misspecification

De Finetti: intuition and motivation

Exchangeable random variables

De Finetti: formal statement

Continuous random variables: motivation

Continuous random variables: first implementation

Monte Carlo as a black box

How to create Monte Carlo samples \(S\)

Computing a mean

More asymptotic regimes

Moving towards more interesting models: challenger disaster and GLMs

Bayesian approach and JAGS implementation

Second new example