# Bayesian calibration

Thought experiment: what if a certain Bayesian inference method is used many times?

• For example, different statisticians, each studying similar but different datasets
• They replicate an “experiment”
• Leads to frequentist analysis of Bayesian procedures - a very useful thing to do!
• The dataset in each dataset is of fixed, finite size (in contrast to the previous topic of consistency, where dataset size grew larger)

Calibration of credible intervals: a calibrated 90% credible interval should contain the true parameter 90% of the time.

What do we mean by “true parameter”?

• Start with a joint distribution, $$\gamma^\star(z, x)$$, which we will call “nature”
• For each “experiment”: use $$\gamma^\star$$ to simulate both a true parameter, $$z^\star$$, and an associated dataset $$x^\star$$: $$(z^\star, x^\star) \sim \gamma^\star$$. ($$=$$ notion of “generating synthetic data from graphical model topic”)

What do we mean by “90% credible interval” a function $$C(x) = [L(x), R(x)]$$ which (1) computes the posterior $$\pi(\cdot | x)$$, and (2), selects left and right end points, $$l$$, $$r$$ such that $$\int_l^r \pi(z | x) {\text{d}}z = 0.9$$. Example: HDI from last week.

What do we mean by “90% of the time”

• Loop over $$1, 2, \dots,$$ numberOfExperiments
• generate synthetic data $$(z^\star, x^\star) \sim \gamma^\star$$
• compute the credible $$C(x^\star)$$
• record if the true parameter is in the interval, $$z^\star \in C(x^\star)$$
• Consider the limit, as numberOfExperiments $$\to \infty$$, of the fraction of times the true parameter is in the interval
• If the limit is equal to 0.9 we say the credible interval is calibrated (great!)
• If the limit is close to 0.9 we say the credible interval is approximately calibrated (that’s not too bad)
• If the limit is higher than 0.9, we say the credible interval is over-conservative (that’s not too bad)
• If the limit is lower than 0.9, we say the credible interval is anti-conservative (bad!)

# Well specified vs misspecified models

• There are two joint distributions involved in the thought experiment
• $$\gamma^*$$, used to generate data
• $$\gamma$$, used internally by the credible interval procedure to define a posterior $$\pi \propto \gamma$$
• We can consider the following setups
• Well-specified setup $$\gamma^* = \gamma$$
• Misspecified, $$\gamma^* \neq \gamma$$
• For now: focus on the well-specified setup

# Short exercise

• Setup: credible intervals in a simple beta binomial model
• the beta-binomial model with a beta(1, 1) prior (e.g. Delta rocket, water-land problem)
• to speed up inference, consider using conjugacy instead of blang (so that you can do the exercise in straightforward R or python)
• from exercise set 1, beta($$\alpha, \beta$$) prior, $$n$$ trials, $$k$$ successes $$\Longrightarrow$$ beta($$\alpha + k, \beta + (n-k)$$) posterior (this was one of the optional problems)
• for simplicity, instead of HDI, use the inverse CDF function of the posterior to chop $$0.05$$ probability on each side of the posterior distribution
• Use the well-specified setup described earlier
• first, with a large dataset ($$n = 10000$$)
• second, with a small dataset ($$n = 2$$)
• Use simulations to speculate on the calibration of the credible interval in the two setups ($$n = 10000$$ and $$n=2$$)

### Poll: make a guess!

1. calibrated for small data, calibrated for large data
2. not calibrated for small data, calibrated for large data
3. only approximately calibrated for both small and large data
4. none of the above

# Readings

After doing this week’s exercise on calibration, read the following tutorial, especially sections 4 and 6: https://arxiv.org/abs/2011.01808