Bayesian calibration
Thought experiment: what if a certain Bayesian inference method is used many times?
- For example, different statisticians, each studying similar but different datasets
- They replicate an “experiment”
- Leads to frequentist analysis of Bayesian procedures - a very useful thing to do!
- The dataset in each dataset is of fixed, finite size (in contrast to the previous topic of consistency, where dataset size grew larger)
Calibration of credible intervals: a calibrated 90% credible interval should contain the true parameter 90% of the time.
What do we mean by “true parameter”?
- Start with a joint distribution, \(\gamma^\star(z, x)\), which we will call “nature”
- For each “experiment”: use \(\gamma^\star\) to simulate both a true parameter, \(z^\star\), and an associated dataset \(x^\star\): \((z^\star, x^\star) \sim \gamma^\star\). (\(=\) notion of “generating synthetic data from graphical model topic”)
What do we mean by “90% credible interval” a function \(C(x) = [L(x), R(x)]\) which (1) computes the posterior \(\pi(\cdot | x)\), and (2), selects left and right end points, \(l\), \(r\) such that \(\int_l^r \pi(z | x) {\text{d}}z = 0.9\). Example: HDI from last week.
What do we mean by “90% of the time”
- Loop over \(1, 2, \dots,\) numberOfExperiments
- generate synthetic data \((z^\star, x^\star) \sim \gamma^\star\)
- compute the credible \(C(x^\star)\)
- record if the true parameter is in the interval, \(z^\star \in C(x^\star)\)
- Consider the limit, as numberOfExperiments \(\to \infty\), of the fraction of times the true parameter is in the interval
- If the limit is equal to 0.9 we say the credible interval is calibrated (great!)
- If the limit is close to 0.9 we say the credible interval is approximately calibrated (that’s not too bad)
- If the limit is higher than 0.9, we say the credible interval is over-conservative (that’s not too bad)
- If the limit is lower than 0.9, we say the credible interval is anti-conservative (bad!)