Calibration, continued; Bernstein-von Mises

Alexandre Bouchard-Côté

Bayesian calibration, continued

Exercise debrief

Why well-specified credible intervals are calibrated for any fixed dataset sizes

drawing

Application 1: software testing

Application 2: goodness-of-fit

Poll: how to use calibration to do goodness-of-fit

  1. Use synthetic data, loop over \(i\): check if \(\frac{1}{100} \sum_d {{\bf 1}}[y^{(d)}_1 \in C(y^{(d)}_1, \dots, y^{(d)}_{20})] \approx 0.9\)
  2. Use synthetic data, loop over \(i\): check if \(\frac{1}{100} \sum_d {{\bf 1}}[y^{(d)}_1 \in C(y^{(d)}_2, \dots, y^{(d)}_{20})] \approx 0.9\)
  3. Use real data, loop over \(i\): check if \(\frac{1}{100} \sum_d {{\bf 1}}[y^{(d)}_1 \in C(y^{(d)}_1, \dots, y^{(d)}_{20})] \approx 0.9\)
  4. Use real data, loop over \(i\): check if \(\frac{1}{100} \sum_d {{\bf 1}}[y^{(d)}_1 \in C(y^{(d)}_2, \dots, y^{(d)}_{20})] \approx 0.9\)
  5. None of the above

Recall: \({{\bf 1}}[\cdots]\) denotes the indicator on the event \([\cdot]\), i.e. a function that is 1 if the input outcome \(s\) is in the event, 0 otherwise.

Poll: if the model is well-specified, what do you expect to see

  1. \(f(\alpha)\) should be constant
  2. \(f(\alpha) = \alpha\)
  3. \(f(\alpha) = a \alpha\) for some \(a\)
  4. \(f(\alpha) = a \alpha + b\) for some \(a, b\)
  5. Something else

Misspecified models and calibration

What to do if your model is misspecified?

One obvious possibility is to make the model better. But in the following we show that under certain conditions, another possibility is to collect more data.

 # Using now the same non-uniform prior as before for posterior calculation
K <- 1000

rdunif <- function(max) { return(ceiling(max*runif(1))) }

posterior_distribution <- function(prior_probs, n_successes, n_trials){         
  K <- length(prior_probs)-1 # K+1 values that your p can assume
  n_fails <- n_trials - n_successes
  p <- seq(0, 1, 1/K)
  posterior_probs <-                                       # 1. this computes gamma(i)
    prior_probs *                                          #    - prior
    p^n_successes * (1-p)^n_fails                          #    - likelihood 
  posterior_probs <- posterior_probs/sum(posterior_probs)  # 2. normalize gamma(i)
  post_prob <- rbind(p, posterior_probs)
  return(post_prob)
}

high_density_intervals <- function(alpha, posterior_probs){
  ordered_probs = posterior_probs[,order(posterior_probs[2,], decreasing = TRUE)]
  cumulative_probs = cumsum(ordered_probs[2,])
  index = which.max(cumulative_probs >= (1-alpha))
  return(ordered_probs[,1:index, drop=FALSE])
}

hdi_coverage_pr <- function(n_datapoints) {
  n_inclusions <- 0
  for (repetition in seq(1:n_repeats)) {
    i <- rdunif(K + 1) - 1  # Always generate the data using a uniform prior
    true_p <- i/K
    x <- rbinom(1, n_datapoints, true_p)
    post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
    
    # This if is just a hacky way to check if true parameter is in the HDI credible interval
    if (sum(abs(true_p - high_density_intervals(alpha, post)[1,]) < 10e-10) == 1) {
      n_inclusions <- n_inclusions + 1
    }
  }
  return(n_inclusions/n_repeats) # Fraction of simulation where the true parameter was in interval
}

prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)

set.seed(1)

n_repeats <- 1000
alpha <- 0.1

df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)

ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
  ylim(0, 1) +
  geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
  geom_line()

Asymptotic calibration under certain misspecification

drawing

Bernstein-von Mises: under certain conditions, even when the prior is misspecified the actual coverage of credible intervals converges to the nominal coverage. (converge just means “gets arbitrarily close”)

Conditions include, but are not limited to:

  1. \(\theta \in \Theta\) needs to live in a continuous rather than discrete space! I.e.: \(\Theta \subset {\mathbf{R}}^d\)
    • Intuition: Bernstein-von Mises actually proves something stronger: convergence of the rescaled, centered posterior to a normal distribution, which is a continuous distribution (can you guess the scaling?)
    • “Bayesian Central Limit Theorem”
  2. \(p(\theta)\) (the prior) is now a density instead of a PMF,
    • we assume it is non-zero in a neighborhood of the true parameter
  3. we assume the posterior mean is consistent
    • recall: that means the error goes to zero as number of data points increases
  4. we assume the model is parametric:
    • \(\theta\) is allowed to be a vector, but you are not allowed to make it bigger as the number of observations increases

Practical consequence: