Bayesian calibration

Thought experiment: what if a certain Bayesian inference method is used many times?

For example, different statisticians, each studying similar but different datasets
They replicate an “experiment”
Leads to frequentist analysis of Bayesian procedures - a very useful thing to do!
The dataset in each dataset is of fixed, finite size (in contrast to the previous topic of consistency, where dataset size grew larger)

Calibration of credible intervals: a calibrated 90% credible interval should contain the true parameter 90% of the time.

What do we mean by “true parameter”?

Start with a joint distribution, \(\gamma^\star(z, x)\), which we will call “nature”
For each “experiment”: use \(\gamma^\star\) to simulate both a true parameter, \(z^\star\), and an associated dataset \(x^\star\): \((z^\star, x^\star) \sim \gamma^\star\). (\(=\) notion of “generating synthetic data from graphical model topic”)

What do we mean by “90% credible interval” a function \(C(x) = [L(x), R(x)]\) which (1) computes the posterior \(\pi(\cdot | x)\), and (2), selects left and right end points, \(l\), \(r\) such that \(\int_l^r \pi(z | x) {\text{d}}z = 0.9\). Example: HDI from last week.

What do we mean by “90% of the time”

Loop over \(1, 2, \dots,\) numberOfExperiments
- generate synthetic data \((z^\star, x^\star) \sim \gamma^\star\)
- compute the credible \(C(x^\star)\)
- record if the true parameter is in the interval, \(z^\star \in C(x^\star)\)
Consider the limit, as numberOfExperiments \(\to \infty\), of the fraction of times the true parameter is in the interval
- If the limit is equal to 0.9 we say the credible interval is calibrated (great!)
- If the limit is close to 0.9 we say the credible interval is approximately calibrated (that’s not too bad)
- If the limit is higher than 0.9, we say the credible interval is over-conservative (that’s not too bad)
- If the limit is lower than 0.9, we say the credible interval is anti-conservative (bad!)

Well specified vs misspecified models

There are two joint distributions involved in the thought experiment
- \(\gamma^*\), used to generate data
- \(\gamma\), used internally by the credible interval procedure to define a posterior \(\pi \propto \gamma\)
We can consider the following setups
- Well-specified setup \(\gamma^* = \gamma\)
- Misspecified, \(\gamma^* \neq \gamma\)
For now: focus on the well-specified setup

Short exercise

Setup: credible intervals in a simple beta binomial model
- the beta-binomial model with a beta(1, 1) prior (e.g. Delta rocket, water-land problem)
- to speed up inference, consider using conjugacy instead of blang (so that you can do the exercise in straightforward R or python)
  - from exercise set 1, beta(\(\alpha, \beta\)) prior, \(n\) trials, \(k\) successes \(\Longrightarrow\) beta(\(\alpha + k, \beta + (n-k)\)) posterior (this was one of the optional problems)
- for simplicity, instead of HDI, use the inverse CDF function of the posterior to chop \(0.05\) probability on each side of the posterior distribution
Use the well-specified setup described earlier
- first, with a large dataset (\(n = 10000\))
- second, with a small dataset (\(n = 2\))
Use simulations to speculate on the calibration of the credible interval in the two setups (\(n = 10000\) and \(n=2\))

Poll: make a guess!

calibrated for small data, calibrated for large data
not calibrated for small data, calibrated for large data
only approximately calibrated for both small and large data
none of the above

Readings

After doing this week’s exercise on calibration, read the following tutorial, especially sections 4 and 6: https://arxiv.org/abs/2011.01808

(optional, but recommended if you are interested in implementing MCMC methods) To dig deeper into the practically critical (for MCMC developers) notion of simulation based calibration, see Validation of Software for Bayesian Models Using Posterior; and original paper: Getting it right