Recall: Accuracy of point estimate as a function of number of observations?

You are asked for one more significant digit in your point estimate,
- how many additional observations should you collect?
Let’s use some code based on synthetic datasets (similar idea to calibration experiment from last time)

set.seed(1)
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max
K = 1000

# Later, we will use different prior to generate data versus do inference, 
# but for now, use the same (uniform on the K+1 grid points)
prior_used_for_computing_posterior <- rep(1/(K+1), K+1)

simulate_posterior_mean_error <- function(n_datapoints) {
  i <- rdunif(K + 1) - 1  # Always generate the data using a uniform prior
  true_p <- i/K
  x <- rbinom(1, n_datapoints, true_p) # Recall: a Binomial is just a sum of iid Bernoullis
  post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
  post_mean <- posterior_mean(post)
  return(abs(post_mean - true_p)) # Return the error by comparing posterior mean and p used to generate data
}

Recall: Accuracy of point estimate, empirical results

Calling our new function to get errors for:
- 1000 datasets each of size 10 (= 10^1)
- 1000 datasets each of size 32 (= 10^(3/2))
- …

set.seed(1)
df <- data.frame("n_observations" = rep(c(10, 32, 100, 316, 1000), each=1000))
df$error_well_specified <- sapply(df$n_observations, simulate_posterior_mean_error)

ggplot(data=df, aes(x=n_observations, y=error_well_specified+1e-5)) +
  stat_summary(fun.y = mean, geom="line") + # Line averages over 1000 datasets
  scale_x_log10() +  # Show result in log-log scale
  scale_y_log10() +
  geom_point()

Good news: error goes to zero
- This property is called consistency
- Viewed as bare minimum of statistical procedures
- But here we are interested in the rate (how fast it goes to zero?)
Result suggests a linear fit in the log-log scale \(\underbrace{\log_{10}(\text{error})}_{y} = a\; \underbrace{\log_{10}(\text{number of observation})}_{x} + b\)
Eyeball the coefficient \(a = \Delta x / \Delta y\)

Recall: Accuracy of point estimate as a function of number of observations

Get \(a \approx -1/2\)
Hence taking power on each side of

\(\underbrace{\log_{10}(\text{error})}_{y} = a \underbrace{\log_{10}(\text{number of observation})}_{x} + b\)

we get \(\text{error} = O((\text{number of observations})^{-1/2})\)

Similarly to the frequentist setting, for many models, root mean square error will scale like \[\frac{\text{constant}}{\sqrt{\text{number of observations}}}\]
So if we want to reduce it by factor 10 (i.e. “gain one significant digit”), need 100x more observations
To gain 2 more significant digits: need 10,000x more observations, etc
What we did here is an example of an (empirical) asymptotic analysis

Recall: What if we used the “wrong” prior

Recall the non-uniform prior from the exercise question:

prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)
plot(seq(0, 1, by=1/K), prior_used_for_computing_posterior)

The code below will add another curve to the error plot
- In this new curve, the posterior is computed based on the wrong (non-uniform) prior
- But we still simulate the “true” \(p\) uniformly, hence the misspecification

set.seed(2)
df$error_mis_specified <- sapply(df$n_observations, simulate_posterior_mean_error)
df <- gather(data=df, key=prior_type, value=error, error_mis_specified, error_well_specified)

ggplot(data=df, aes(x=n_observations, y=error+1e-5, colour=factor(prior_type))) +
  stat_summary(fun.y = mean, geom="line") +
  scale_x_log10() + 
  scale_y_log10() +
  geom_point()

The misspecified model exhibits slightly lower performance when there are 10 observations,
- but the performance of the two models become very close as the number of observations increases
Do you think this will always be like this for any misspecified models?

Recall: First way to break consistency: violating Cromwell’s rule

Suppose we put zero prior mass to having a fair dice. What happens on the posterior?

drawing

If we put prior mass of zero to some event, say \(\pi_0(A) = 0\) then the posterior probability of that event will always be zero, \(\pi_n(A) = 0\), no matter how many observations \(n\) we get
Cromwell’s rule (Oliver Cromwell, 1650): “I beseech you, in the bowels of Christ, think it possible that you may be mistaken.”

Second way to break consistency: Unidentifiable models

When there are too many parameters, you may not be able to recover them even if you had “infinite data”
Example:
- Let us modify the prior for the rocket example (continuous version), setting \(p = p_1 p_2\) where \(p_1 \sim {\text{Unif}}(0,1)\) and \(p_2 \sim {\text{Unif}}(0,1)\).
- Intuition: there are pairs of \((p_1, p_2)\) that yield the same joint density, e.g. \((p_1, p_2) = (0.1, 0.2)\) and \((p_1, p_2) = (0.05, 0.4)\).
Model code in the Blang PPL

model Unidentifiable {
  ...
  laws {
    p1 ~ ContinuousUniform(0, 1)
    p2 ~ ContinuousUniform(0, 1)
    nFails | p1, p2, nLaunches ~ Binomial(nLaunches, p1 * p2)
  }
}

Results

drawing

Observations:
- The posterior does not become degenerate as the sample size increases
- It is now asymmetric. This is not a bug! Can you figure out why?
This is an example of what is called an unidentifiable model. There are two distinct parameters, \(\theta_1\) and \(\theta_2\), such that \(\theta_1 \neq \theta_2\) yet the likelihood are exactly equivalent \({\mathbb{P}}_{\theta_1} = {\mathbb{P}}_{\theta_2}\).
Note: Paul Gustafson is the expert on Bayesian unidentifiable / weakly identifiable models

Theoretical justification of consistency: Doob’s consistency theorem

In the exchangeable setup, outside of these two failure modes, we are guaranteed to have consistency. The details of the statement of the theorem are somewhat delicate and need to be understood to correctly understand the result.
Frequentist notions of consistency under \(\theta\):
- setup:
  - a parametric model, \(\{{\mathbb{P}}_\theta:\theta \in \Theta\}\), defined on space \(\Omega\)
  - observed random variables \(X_i : \Omega \to {\mathbf{R}}^d\)
  - estimator \(T_n(x_1, x_2, \dots, x_n)\) for \(\theta\)
- consistency:
  - compose the observation r.v. with the estimator: \(T_n = T_n(X_1, X_2, \dots, X_n)\)
  - ask that \(T_n \to \theta\) as \(n \to \infty\), under a suitable notion of convergence of random variables…
  - either “in \({\mathbb{P}}_\theta\)-probability” or “\({\mathbb{P}}_\theta\) almost sure” (latter AKA a.s. or strong consistency)
    - in probability: \(\epsilon > 0, {\mathbb{P}}_\theta(d(T_n, \theta) < \epsilon) \to 1\) for some metric \(d\)
    - almost sure: \({\mathbb{P}}_\theta(T_n \to \theta) = 1\)
  - frequentists ask that the above holds for all \(\theta\)
Bayesian notion of consistency under \(\theta\)
- setup:
  - assume the “exchangeable setup”
  - let \(\Pi_n(\cdot) = \Pi_n(\cdot | X_1, X_2, \dots, X_n)\) denote the (random) posterior (why random: as for the frequentist case, composition of the observations random variable with the posterior update map)
  - ask that \(\Pi_n(A) \to \delta_\theta(A)\) for any \(A\), under a suitable notion of convergence of random variables…
  - either “in \({\mathbb{P}}_\theta\)-probability” or “\({\mathbb{P}}_\theta\) almost sure” (latter AKA a.s. or strong consistency)
    - in probability: \(\epsilon > 0, {\mathbb{P}}_\theta(|\Pi_n(A) - \delta_\theta(A)| < \epsilon) \to 1\)
    - almost sure: \({\mathbb{P}}_\theta(\Pi_n(A) \to \delta_\theta(A)) = 1\)
  - Bayesians ask that the above holds for a set of “good” \(\theta\)’s, denoted G, such that this good set has probability one under the prior \(\pi_0(G) = 1\).

Theorem (Doob’s consistency): if an exchangeable model is identifiable, Bayesian consistency holds for this model.

Corollary: if you use the wrong prior, \(\tilde \pi_0\), but \(\tilde \pi_0(A) > 0 \Longleftrightarrow \pi_0(A) > 0\), then Doob’s consistency still hold for identifiable exchangeable models.

Note: the same holds true for any point estimate based on the posterior distribution (if you get the posterior right, you will get right anything that’s derived from it)

Exercise debrief: Bayes rule

drawing

Strategy (Bayes’ rule):
1. Compute joint probabilities \(\gamma(i) = {\mathbb{P}}(C_i, H_1, H_2, H_3)\)
  - Use chain rule (multiply edge conditional probabilities)
2. Normalize: compute \(Z = \gamma(0) + \gamma(1) + \gamma(2)\) and divide the vector \(\gamma\) by \(Z\)

\[ \pi(i) = \frac{\gamma(i)}{\sum_{i=0}^K \gamma(i)} \]

Implementation:

posterior_distribution <- function(prior_probs, n_successes, n_trials){         
  K <- length(prior_probs)-1 # K+1 values that your p can assume
  n_fails <- n_trials - n_successes
  p <- seq(0, 1, 1/K)
  posterior_probs <-                                       # 1. this computes gamma(i)
    prior_probs *                                          #    - pr. for picking coin (prior) (=1/3 in picture)
    p^n_successes * (1-p)^n_fails                          #    - conditional for the flips 
  posterior_probs <- posterior_probs/sum(posterior_probs)  # 2. normalize gamma(i)
  post_prob <- rbind(p, posterior_probs)
  return(post_prob)
}

Exercise debrief: point estimates

Usual suspects for providing point estimate summaries of a PMF \(p(x)\)
- Mode: a global optimum of the PMF, \(\text{argmax}_x p(x)\)
  - In the context of posterior distribution, called the maximum a posteriori (MAP)
  - Once we move to continuous models, using MAP is typically a suboptimal
    - But MAP is OK for discrete models
    - Reason for that comes from decision theory
- Mean: \(\sum_{x} x p(x)\)
  - In the context of posterior distribution, called the posterior mean
  - The main point estimate used in Bayesian statistics
- \(\alpha\)-quantiles: \(x^\star\) such that \(F(x^\star) = \alpha\) (pick say smallest if not unique)
  - Posterior quantile
  - Very useful e.g. when the posterior has infinite mean
As we will see later, all these arise from decision theory/the Bayes estimator under different losses

posterior_mean <- function(posterior_probs){
  return(sum(posterior_probs[1,] * posterior_probs[2,]))
}

Exercise debrief: construction of HDIs

You are given a level, typically \(90\%\) (or \(95\), denoted \(\alpha = 0.95\) or \(\alpha = 0.9\))
Credible interval: a set \(T\) such that \(\sum_{x \in T} p(x) \approx 1-\alpha\)
High Density “Intervals”: a type of credible interval, where we pick a solution with the least number of points \(|T|\): \[\text{argmin}\{ |T| : \sum_{x \in T} p(x) \ge 0.9\}\]

drawing

Solution: greedy algorithm.
- Include spikes in the PMF sequentially.
- At each stage, pick the largest one you haven’t picked yet.
- Until the sum of the spike heights sums to more than \(0.9\)

high_density_intervals <- function(alpha, posterior_probs){
  ordered_probs = posterior_probs[,order(posterior_probs[2,], decreasing = TRUE)]
  cumulative_probs = cumsum(ordered_probs[2,])
  index = which.max(cumulative_probs >= (1-alpha))
  return(ordered_probs[,1:index, drop=FALSE])
}

Footnote:
- because of the discrete nature of the optimization, you will find that the sum of the spikes in the solution \(T_*\) is not exactly equal to \(0.9\).
- However for fine discretization it will be very close to \(0.9\).
- In fact, it would be possible to modify the interval a tiny bit at random so that its expected sum of spikes is exactly \(0.9\).
- TL;DR: This footnote problem will disappear anyways with continuous distributions so ignore it.

Exercise debrief: calibration of credible intervals

Recall: notion of prediction calibration, “being right 90% of the time when you say you are 90% sure”
Today: credible interval calibration
- a calibrated 90% credible interval should contain the true parameter 90% of the time
Let’s see if this is true using simulated data!
- Let’s check on 1000 small data (10 rocket launches)
- Let’s check on 1000 medium data (100 rocket launches)
- Let’s check on 1000 large data (1000 rocket launches)
What do you expect?

set.seed(1)
K <- 1000
prior_used_for_computing_posterior <- rep(1/(K+1), K+1) # Use same prior for simulation and posterior computation
n_repeats <- 1000
alpha <- 0.1

hdi_coverage_pr <- function(n_datapoints) {
  n_inclusions <- 0
  for (repetition in seq(1:n_repeats)) {
    i <- rdunif(K + 1) - 1  # Always generate the data using a uniform prior
    true_p <- i/K
    x <- rbinom(1, n_datapoints, true_p)
    post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
    
    # This if is just a hacky way to check if true parameter is in the HDI credible interval
    if (sum(abs(true_p - high_density_intervals(alpha, post)[1,]) < 10e-10) == 1) {
      n_inclusions <- n_inclusions + 1
    }
  }
  return(n_inclusions/n_repeats) # Fraction of simulation where the true parameter was in interval
}

df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)

ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
  ylim(0, 1) +
  geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
  geom_line()

Suprise!!: in a Bayesian well-specified context, calibration holds for all dataset sizes!
- Contrast with typical frequentist intervals, usually calibrated only for large enough dataset size
- This is because frequentist interval need to rely on asymptotics to be computable (Central Limit Theorem)
Bonus 1: this is a great way to test correctness of your software implementation.
- Correct Bayesian inference code: calibrated.
- Buggy Bayesian inference code: almost never calibrated.
- It is nice because you can run your test on small dataset where inference is fast.
  - Pick the test just big enough to get code coverage.
Bonus 2: could you use this to perform model “goodness-of-fit”? How?
- Harness the fact that for a Bayesian, prediction and parameter estimation are done in the same framework
- First, let us pick a simple setup, say iid univariate observations \(y = (y_1, y_2, \dots, y_n)\)
- Let \(y_{-i}\) denote all the observations excluding the \(i\)-th one, \(y_{-i} = (y_1, y_2, \dots, y_{i-1}, y_{i+1}, \dots, y_n)\)
- Idea: if the model is well-specified, \({\mathbb{P}}(Y_{i} \in C_\alpha(Y_{-i})) = \alpha\)
- Approximate LHS using leave-one-out: \[{\mathbb{P}}(Y_i \in C_\alpha(Y_{-i})) \approx \frac{1}{n} \sum_i I[y_{i} \in C_\alpha(y_{-i})]\] where \(I[\cdots]\) denotes the indicator on the event \([\cdot]\), i.e. a function that is 1 if the input outcome \(s\) is in the event, 0 otherwise.
- Special case: cross-validated probability integral transform (PIT), see this paper for practical methods to compute it using importance sampling and other tricks, which uses simple credible interval of the form \(C_\alpha = (-\infty, x]\) (my hunch: it might be better to use HDI; makes no difference if the model is well-specified, but it might in the misspecied case, which matters since we are doing goodness-of-fit!)
- PIT is related to the more commonly used but less formal “posterior predictive check”, where data point \(i\) is not excluded
  - easier to compute (only one posterior instead of \(n\) posteriors)
  - no theoretical guarantees and hence harder to interpret
  - useful as a quick and dirty visualization trick
Summary:
- People often use “synthetic data” as follows:
  - Generate data, hold out the true parameter \(\theta^\star\)
  - Report distance between inferred \(\theta\) and \(\theta^\star\)
  - Now how to interpret this distance? Could be unclear.
- Message here: there are other ways to use synthetic data in a Bayesian context!

Why well-specified credible intervals are calibrated for any fixed dataset sizes

drawing

Denote the data by \(X\)
Parameter of interest: \(I\)
Denote the credible interval by \(C(X)\)
We want to show \({\mathbb{P}}(I \in C(X)) = 0.9\)
By construction (see also earlier footnote for detail), the interval satisfies, for all observed data \(x\), \({\mathbb{P}}(I \in C(x) | X = x) = 0.9\)
Recall the law of total probability:
- for any partition \(E_i\) of the sample space,
- we have \({\mathbb{P}}(A) = \sum_i {\mathbb{P}}(A, E_i)\)
Here: \(A = (I \in C(X))\), \(E_i = (X = x_i)\)
- where \(\{x_i\}\) denotes the possible values the data can potentially take across all possible simulated experiments.
Putting it all together and applying chain rule: \[\begin{align*} {\mathbb{P}}(I \in C(X)) &= \sum_i {\mathbb{P}}(I \in C(X), X = x_i) \\ &= \sum_i {\mathbb{P}}(X = x_i) {\mathbb{P}}(I \in C(x_i) | X = x_i) \\ &= \sum_i {\mathbb{P}}(X = x_i) 0.9 \\ &= 0.9 \sum_i {\mathbb{P}}(X = x_i) = 0.9. \end{align*}\]
Note: nice way to write LTE and direct consequence of technical def of conditional expectation: \({\mathbb{P}}(A|X)\) should be viewed as a random variable which is a function of \(X\), \({\mathbb{P}}(A|X) = f(X)\) for some \(f\), so \[{\mathbb{P}}(A) = {\mathbf{E}}[{\mathbb{P}}(A | X)]\]

Calibration of credible intervals (misspecified case)

Now suppose we are using the “wrong prior”, i.e. data generation uses uniform prior but we base or posterior computation on the non-uniform prior from earlier
Again, let’s do it for small (10 launches), medium (100), and large datasets (1000), plotting the nominal coverage (dashed) against the actual coverage (solid line)

# Using now the same non-uniform prior as before for posterior calculation
prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)

set.seed(1)
K <- 1000
n_repeats <- 1000
alpha <- 0.1

df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)

ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
  ylim(0, 1) +
  geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
  geom_line()

Bad news: for small datasets we are no longer calibrated, in the worst possible way
- Higher than dash line: would have meant inference is being conservative, i.e. more right than it actually claimed. That’s not too bad.
- Lower than dash line: we are being overconfident or anti-conservative, which in some case can be reckless
Good news: this gets quickly corrected as dataset gets larger. Why?

Asymptotic calibration under certain misspecification

There is no general calibration theory for misspecified models, only special cases (outside of that, use simulation studies or cross-validation techniques)
Setup in which we have a theorem: graphical model of the following form

drawing

Bernstein-von Mises: under certain conditions, even when the prior is misspecified the actual coverage of credible intervals converges to the nominal coverage. (converge just means “gets arbitrarily close”)

Conditions include, but are not limited to:

\(\theta \in \Theta\) needs to live in a continuous rather than discrete space! I.e.: \(\Theta \subset {\mathbf{R}}^d\)
- Intuition: Bernstein-von Mises actually proves something stronger: convergence of the rescaled, centered posterior to a normal distribution, which is a continuous distribution (can you guess the scaling?)
- “Bayesian Central Limit Theorem”
\(p(\theta)\) (the prior) is now a density instead of a PMF,
- we assume it is non-zero in a neighborhood of the true parameter
we assume the posterior mean is consistent
- recall: that means the error goes to zero as number of data points increases
we assume the model is parametric:
- \(\theta\) is allowed to be a vector, but you are not allowed to make it bigger as the number of observations increases

We will explore these conditions in more detail.

Survey: who is familiar with asymptotic normality of the MLE?

Practical consequence:

for large datasets, and under certain condition, the prior is “washed away” by the likelihood
- The conditions are not crazy and hold in some situations of interest
- But there are also simple practical situations where the conditions do not hold (e.g. prior assigning zero mass to important regions)

Continuous random variables: motivation

Example: \(p \in [0, 1]\) instead of \(p \in \{0, 1/K, 2/K, \dots, 1\}\)
Motivations to having continuous random variables:
- Discretiztion becomes exponentially expensive
- We want Bernstein-von Mises to kick in
- Annoying to select a discretization size (\(K\) in our rocket example)

Continuous random variables: first implementation

Question: how to store the posterior distribution of a continuous random variables in a computer?
- With discrete random variable: we could do it with an array of size \(K\)
- For continuous random variables: use a density (first answer)

Recall: how to use a density to compute a probability?

Say a random variable \(X\) has density \(f\)
I want to know the probability that \(X\) is between say 0.1 and 0.3
Answer:

drawing

\[{\mathbb{P}}(X \in [0.1, 0.3]) = \int_{0.1}^{0.3} f(x) {\text{d}}x\] drawing

At this point, you should have tears in your eyes: integrals are nasty
- No closed form in general
  - Even if we pick nice prior with closed form integrals, posterior integrals might be nasty
  - Recall in the rocket example, we observed that even if we start with uniform prior posterior is not uniform
- If \(X = (X_1, X_2, \dots, X_d)\) is a random vector (stacked random variables), we need \(d\) iterated integrals
- Numerical approximation (trapezoid, etc) does not scale to models having many dependent random variables

Monte Carlo as a black box

Motivation:

we want continuous continuous random variables without having to compute integrals (no closed form, >10 dimensions is too slow to do numerical integration)
we want combinatorial objects without hard sums

However, for simplicity, let us start by assuming the model has only 1 real valued unknown parameter.

Idea:

Store a set \(S\) of random points called Monte Carlo samples,
- \(S = \{X_1, X_2, \dots, X_M\}\),
chosen such that probabilities can be approximated by:

\[{\mathbb{P}}(X \in [0.1, 0.3]) \approx \frac{\# \text{ points in } S \text{ that fall in }[0.1, 0.3]}{M}\]

Intuition. Let’s say you want to compute the area of this weird curved shape \(A\):

drawing

Calculus: need the following double integral over the weird shape \(A\) \[\int \int_A {\text{d}}x {\text{d}}y\]
Monte Carlo method:
- Simulate random points in the square \(S = \{X_1, X_2, \dots, X_M\}\)
- Return the fraction of the points that fall inside the weird shape \(A\)
Note: all we need to do for each point \(X_i\) is to check if the point falls in the region or not. This is called pointwise evaluation. No need to compute complicated integrals.

How to create Monte Carlo samples \(S\)

Guiding principle: we have two ways to compute probabilities: densities and Monte Carlo. We want that as the size of \(S\) increases, the discrepancy between the two methods should become arbitrary close to zero

Mathematically: we want as \(M \to \infty\), \[\frac{\# \text{ points in } S \text{ that fall in }[0.1, 0.3]}{M} \to \int_{0.1}^{0.3} f(x) {\text{d}}x\] with “overwhelming probability”.
This property is called a “Law of Large Numbers”
There are several ways to get a “Law of Large Numbers” (a fact not emphasized enough!)
- Independent and identically distributed random variables \(X_i\) (the case you may be familiar with, but not the only one!)
- The list of state visited Markov chain with a long term behaviour having density \(f\) (called stationary density)
But for now you can still treat \(S\) as magically provided by a PPL black box

Recap: computing a mean

drawing

In our weird shape example, computing mean is like: (1) take cisors and cut the shape (2) find the point where we would put our finger under and it would keep the shape horizontally
Density way (tears):

\[{\mathbf{E}}[X] = \int x f(x) {\text{d}}x\]

Monte Carlo way (much easier to implement):

\[{\mathbf{E}}[X] \approx \frac{\sum_{i=1}^M X_i}{M}\]

Again, Laws of Large Numbers link the two
In our weird shape example, separately average the positions for each dimension over the points that fall in the shape.
For variance, use \(\text{Var}[X] = {\mathbf{E}}[X^2] - ({\mathbf{E}}[X])^2\), we have the second term already then for the first term

\[{\mathbf{E}}[X^2] \approx \frac{\sum_{i=1}^M X^2_i}{M}\]

Makes sense: think about it as defining a new random variable \(Y = X^2\) and computing the mean of that transformed random variable…
Generalizing this trick: if I give you a nice enough function \(g\) such as \(g(x) = x^2\),

\[{\mathbf{E}}[g(X)] \approx \frac{\sum_{i=1}^M g(X_i)}{M}\]

Again, all we need is “pointwise evaluation”, even if \(g\) is complicated.
Now what if…:
- There are several real valued unknown parameters?
- The unknown parameter is not real valued? E.g.: a random colour or a discrete structure such as graph, etc.

More asymptotic regimes

We have done two thought experiments so far:

Letting the number of observations (rocket launches) go to infinity, \(N \to \infty\)
- Data collection budget
- Often no control on that
Letting the number of Monte Carlo samples go to infinity, \(M \to \infty\)
- Computational budget
- If not happy with results, easy to increase

drawing

Pop quick: is the picture on the bottom right always a point mass?

Note:

for “nice” model such as this one, as \(N \to \infty\) the posterior gets arbitrarily peaky (“converges to a point mass”).
Conditions will be stated next week.
But there are less nice models where we do not get convergence to a point mass as the data size goes to infinity.
For example, recall the behaviour for unidentifiable models

drawing

These will often be harder to sample in practice
- People always think about “multi-modal” problem to motivate advanced MCMC
- Multi-modality do arise and cause problem, but in my experience, the more serious problems are rather unidentifiable or weakly identifiable models in practice

De Finetti: intuition and motivation

We have invoked the “exchangeable/iid” setup several times already
De Finetti theorem will give us more insight in this class of models
Motivation:

drawing

Often, may not want to rely on the order of the rows in the spreadsheet given to you. Just shuffle it to be safe.
In such circustance, de Finetti says that there exists a well-specified model with the graphical model on the right
- I.e. that there exists a prior such that the data is iid given that prior
- This motivation gives you an idea of the theorem, but is not actually 100% exactly what de Finetti says, let’s formalize the theorem!

Exchangeable random variables

Recall: notion of equality in distribution \(X_1 {\stackrel{\scriptscriptstyle d}{=}}X_2\)
- …this means the distribution of \(X_1\) is equal to the distribution of \(X_2\)
- …concretely: draw the marginal PMF or density of each random variable, check if they match
- Example where \(X_1 {\stackrel{\scriptscriptstyle d}{=}}X_2\) but \(X_1 \neq X_2\)?
- Highlight to reveal the answer: You flip a coin and say what is on the top \(X_1\). Your friend say what is on the bottom \(X_2\)
Two random variables are exchangeable if \((X_1, X_2) {\stackrel{\scriptscriptstyle d}{=}}(X_2, X_1)\)
- Example: indicator variable (i.e. Bernoulli random variables).
  - Develop a criterion for checking from the joint density if it’s exchangeable.
    - Highlight to reveal the answer: It should be symmetric with respect to the line \(y = x\)
  - Show the indicator on a square, a circle, and checkers board centered at zero leads to exchangeable random variables. Which one is iid?
    - Highlight to reveal the answer: Only the square is iid
- Note: exchangeability implies identical distributions
- Note: this notion is closely related to reversibility
Generalization: \(n\) variables are exchangeable if for all permutations \(\sigma : \{1, 2, \dots, n\} \to \{1, 2, \dots, n\} \in S_n\), \((X_1, X_2, \dots, X_n) {\stackrel{\scriptscriptstyle d}{=}}(X_{\sigma(1)}, X_{\sigma(2)}, \dots, X_{\sigma(n)})\)
Infinite exchangeable: an infinite sequence of random variable \((X_1, X_2, \dots)\) is exchangeable if all finite subsets are exchangeable.

De Finetti: formal statement

Theorem: if \((X_1, X_2, \dots)\) are exchangeable Bernoulli random variables, then there exists a \(\theta\), a prior on \(\theta\), and a likelihood such that the observations are iid given \(\theta\).

Note: This is not guaranteed to hold for finite \(n\) (e.g., works for the checker board, not possible for the circle)
Leads to a philosophy: “Stochastic process based design of Bayesian models”
- Construct models containing an infinite number of observations a priori
- Condition on only the subset you actually observed
- Ideally, construct the model such that you can analytically marginalize the infinite suffix of unobserved data
  - Note: it is trivial to marginalization a node \(X_i\) in a graphical model such that there is no directed path from \(X_i\) to any observation

STAT 520 - Bayesian Analysis

Goal today

Logistics

Recall: motivation

Recall: some imperfect but useful terminology

Recall: Accuracy of point estimate as a function of number of observations?

Recall: Accuracy of point estimate, empirical results

Recall: Accuracy of point estimate as a function of number of observations

Recall: What if we used the “wrong” prior

Recall: First way to break consistency: violating Cromwell’s rule

Second way to break consistency: Unidentifiable models

Theoretical justification of consistency: Doob’s consistency theorem

Exercise debrief: Bayes rule

Exercise debrief: point estimates

Exercise debrief: construction of HDIs

Exercise debrief: calibration of credible intervals

Why well-specified credible intervals are calibrated for any fixed dataset sizes

Calibration of credible intervals (misspecified case)

Asymptotic calibration under certain misspecification

Continuous random variables: motivation

Continuous random variables: first implementation

Monte Carlo as a black box

How to create Monte Carlo samples \(S\)

Recap: computing a mean

More asymptotic regimes

De Finetti: intuition and motivation

Exchangeable random variables

De Finetti: formal statement