Processing math: 100%
STAT 520 - Bayesian Analysis
Alexandre Bouchard-Côté
3/4/2019
Goal today
- Consistency, continued
- Exercise debrief
- Calibration, continued
- Bernstein von-Mises
Recall: motivation
Typical statistics course: a bunch of recipes.
- Here is a method for clustering. Here is one for classification. Here is one for recommender systems. Etc.
- Every problem is a new problem.
- If it’s not in the textbook, the frequentist might need a PhD student to spend 5 years on the problem.
Bayesian statistics: a philosophy applicable to pretty much any data analysis problem
- I am not saying you should always use it, but you almost always can and it might often be your only option if you want to use all information at hand in a realistic modelling time frame
Recall: some imperfect but useful terminology
- One graphical model that occurs a lot is this one (sometimes called an “exchangeable” or “iid” setup" for reasons to be discussed later)

- In this setup, a convention is:
- to call the distribution on θ the “prior”
- to call the conditional distribution of the observation given the prior, the “likelihood”
- However keep in mind the Bayesian paradigm applies to any graphical models, not just this one.
- In some other graphical models, it may be awkward to decide what is the “prior” and what is the “likelihood”.
- But to perform Bayesian inference, all we need is a joint distribution at the end of the day.
- However, many theoretical results do assume the exchangeable setup
Recall: Accuracy of point estimate as a function of number of observations?
- You are asked for one more significant digit in your point estimate,
- how many additional observations should you collect?
- Let’s use some code based on synthetic datasets (similar idea to calibration experiment from last time)
set.seed(1)
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max
K = 1000
# Later, we will use different prior to generate data versus do inference,
# but for now, use the same (uniform on the K+1 grid points)
prior_used_for_computing_posterior <- rep(1/(K+1), K+1)
simulate_posterior_mean_error <- function(n_datapoints) {
i <- rdunif(K + 1) - 1 # Always generate the data using a uniform prior
true_p <- i/K
x <- rbinom(1, n_datapoints, true_p) # Recall: a Binomial is just a sum of iid Bernoullis
post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
post_mean <- posterior_mean(post)
return(abs(post_mean - true_p)) # Return the error by comparing posterior mean and p used to generate data
}
Recall: Accuracy of point estimate, empirical results
- Calling our new function to get errors for:
- 1000 datasets each of size 10 (= 10^1)
- 1000 datasets each of size 32 (= 10^(3/2))
- …
set.seed(1)
df <- data.frame("n_observations" = rep(c(10, 32, 100, 316, 1000), each=1000))
df$error_well_specified <- sapply(df$n_observations, simulate_posterior_mean_error)
ggplot(data=df, aes(x=n_observations, y=error_well_specified+1e-5)) +
stat_summary(fun.y = mean, geom="line") + # Line averages over 1000 datasets
scale_x_log10() + # Show result in log-log scale
scale_y_log10() +
geom_point()

Recall: Accuracy of point estimate as a function of number of observations
log10(error)⏟y=alog10(number of observation)⏟x+b
we get error=O((number of observations)−1/2)
Similarly to the frequentist setting, for many models, root mean square error will scale like constant√number of observations
- So if we want to reduce it by factor 10 (i.e. “gain one significant digit”), need 100x more observations
To gain 2 more significant digits: need 10,000x more observations, etc
What we did here is an example of an (empirical) asymptotic analysis
Recall: What if we used the “wrong” prior
- Recall the non-uniform prior from the exercise question:
prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)
plot(seq(0, 1, by=1/K), prior_used_for_computing_posterior)

- The code below will add another curve to the error plot
- In this new curve, the posterior is computed based on the wrong (non-uniform) prior
- But we still simulate the “true” p uniformly, hence the misspecification
set.seed(2)
df$error_mis_specified <- sapply(df$n_observations, simulate_posterior_mean_error)
df <- gather(data=df, key=prior_type, value=error, error_mis_specified, error_well_specified)
ggplot(data=df, aes(x=n_observations, y=error+1e-5, colour=factor(prior_type))) +
stat_summary(fun.y = mean, geom="line") +
scale_x_log10() +
scale_y_log10() +
geom_point()

- The misspecified model exhibits slightly lower performance when there are 10 observations,
- but the performance of the two models become very close as the number of observations increases
- Do you think this will always be like this for any misspecified models?
Recall: First way to break consistency: violating Cromwell’s rule
- Suppose we put zero prior mass to having a fair dice. What happens on the posterior?

If we put prior mass of zero to some event, say π0(A)=0 then the posterior probability of that event will always be zero, πn(A)=0, no matter how many observations n we get
Cromwell’s rule (Oliver Cromwell, 1650): “I beseech you, in the bowels of Christ, think it possible that you may be mistaken.”
Second way to break consistency: Unidentifiable models
When there are too many parameters, you may not be able to recover them even if you had “infinite data”
- Example:
- Let us modify the prior for the rocket example (continuous version), setting p=p1p2 where p1∼Unif(0,1) and p2∼Unif(0,1).
- Intuition: there are pairs of (p1,p2) that yield the same joint density, e.g. (p1,p2)=(0.1,0.2) and (p1,p2)=(0.05,0.4).
Model code in the Blang PPL
model Unidentifiable {
...
laws {
p1 ~ ContinuousUniform(0, 1)
p2 ~ ContinuousUniform(0, 1)
nFails | p1, p2, nLaunches ~ Binomial(nLaunches, p1 * p2)
}
}

- Observations:
- The posterior does not become degenerate as the sample size increases
- It is now asymmetric. This is not a bug! Can you figure out why?
This is an example of what is called an unidentifiable model. There are two distinct parameters, θ1 and θ2, such that θ1≠θ2 yet the likelihood are exactly equivalent Pθ1=Pθ2.
Note: Paul Gustafson is the expert on Bayesian unidentifiable / weakly identifiable models
Theoretical justification of consistency: Doob’s consistency theorem
In the exchangeable setup, outside of these two failure modes, we are guaranteed to have consistency. The details of the statement of the theorem are somewhat delicate and need to be understood to correctly understand the result.
- Frequentist notions of consistency under θ:
- setup:
- a parametric model, {Pθ:θ∈Θ}, defined on space Ω
- observed random variables Xi:Ω→Rd
- estimator Tn(x1,x2,…,xn) for θ
- consistency:
- compose the observation r.v. with the estimator: Tn=Tn(X1,X2,…,Xn)
- ask that Tn→θ as n→∞, under a suitable notion of convergence of random variables…
- either “in Pθ-probability” or “Pθ almost sure” (latter AKA a.s. or strong consistency)
- in probability: ϵ>0,Pθ(d(Tn,θ)<ϵ)→1 for some metric d
- almost sure: Pθ(Tn→θ)=1
- frequentists ask that the above holds for all θ
- Bayesian notion of consistency under θ
- setup:
- assume the “exchangeable setup”

- let Πn(⋅)=Πn(⋅|X1,X2,…,Xn) denote the (random) posterior (why random: as for the frequentist case, composition of the observations random variable with the posterior update map)
- ask that Πn(A)→δθ(A) for any A, under a suitable notion of convergence of random variables…
- either “in Pθ-probability” or “Pθ almost sure” (latter AKA a.s. or strong consistency)
- in probability: ϵ>0,Pθ(|Πn(A)−δθ(A)|<ϵ)→1
- almost sure: Pθ(Πn(A)→δθ(A))=1
- Bayesians ask that the above holds for a set of “good” θ’s, denoted G, such that this good set has probability one under the prior π0(G)=1.
Theorem (Doob’s consistency): if an exchangeable model is identifiable, Bayesian consistency holds for this model.
Corollary: if you use the wrong prior, ˜π0, but ˜π0(A)>0⟺π0(A)>0, then Doob’s consistency still hold for identifiable exchangeable models.
Note: the same holds true for any point estimate based on the posterior distribution (if you get the posterior right, you will get right anything that’s derived from it)
Exercise debrief: Bayes rule

- Strategy (Bayes’ rule):
- Compute joint probabilities γ(i)=P(Ci,H1,H2,H3)
- Use chain rule (multiply edge conditional probabilities)
- Normalize: compute Z=γ(0)+γ(1)+γ(2) and divide the vector γ by Z
π(i)=γ(i)∑Ki=0γ(i)
posterior_distribution <- function(prior_probs, n_successes, n_trials){
K <- length(prior_probs)-1 # K+1 values that your p can assume
n_fails <- n_trials - n_successes
p <- seq(0, 1, 1/K)
posterior_probs <- # 1. this computes gamma(i)
prior_probs * # - pr. for picking coin (prior) (=1/3 in picture)
p^n_successes * (1-p)^n_fails # - conditional for the flips
posterior_probs <- posterior_probs/sum(posterior_probs) # 2. normalize gamma(i)
post_prob <- rbind(p, posterior_probs)
return(post_prob)
}
Exercise debrief: point estimates
- Usual suspects for providing point estimate summaries of a PMF p(x)
- Mode: a global optimum of the PMF, argmaxxp(x)
- In the context of posterior distribution, called the maximum a posteriori (MAP)
- Once we move to continuous models, using MAP is typically a suboptimal
- But MAP is OK for discrete models
- Reason for that comes from decision theory
- Mean: ∑xxp(x)
- In the context of posterior distribution, called the posterior mean
- The main point estimate used in Bayesian statistics
- α-quantiles: x⋆ such that F(x⋆)=α (pick say smallest if not unique)
- Posterior quantile
- Very useful e.g. when the posterior has infinite mean
- As we will see later, all these arise from decision theory/the Bayes estimator under different losses
posterior_mean <- function(posterior_probs){
return(sum(posterior_probs[1,] * posterior_probs[2,]))
}
Exercise debrief: construction of HDIs
- You are given a level, typically 90% (or 95, denoted α=0.95 or α=0.9)
- Credible interval: a set T such that ∑x∈Tp(x)≈1−α
- High Density “Intervals”: a type of credible interval, where we pick a solution with the least number of points |T|: argmin{|T|:∑x∈Tp(x)≥0.9}

- Solution: greedy algorithm.
- Include spikes in the PMF sequentially.
- At each stage, pick the largest one you haven’t picked yet.
- Until the sum of the spike heights sums to more than 0.9
high_density_intervals <- function(alpha, posterior_probs){
ordered_probs = posterior_probs[,order(posterior_probs[2,], decreasing = TRUE)]
cumulative_probs = cumsum(ordered_probs[2,])
index = which.max(cumulative_probs >= (1-alpha))
return(ordered_probs[,1:index, drop=FALSE])
}
- Footnote:
- because of the discrete nature of the optimization, you will find that the sum of the spikes in the solution T∗ is not exactly equal to 0.9.
- However for fine discretization it will be very close to 0.9.
- In fact, it would be possible to modify the interval a tiny bit at random so that its expected sum of spikes is exactly 0.9.
- TL;DR: This footnote problem will disappear anyways with continuous distributions so ignore it.
Exercise debrief: calibration of credible intervals
- Recall: notion of prediction calibration, “being right 90% of the time when you say you are 90% sure”
- Today: credible interval calibration
- a calibrated 90% credible interval should contain the true parameter 90% of the time
- Let’s see if this is true using simulated data!
- Let’s check on 1000 small data (10 rocket launches)
- Let’s check on 1000 medium data (100 rocket launches)
- Let’s check on 1000 large data (1000 rocket launches)
- What do you expect?
set.seed(1)
K <- 1000
prior_used_for_computing_posterior <- rep(1/(K+1), K+1) # Use same prior for simulation and posterior computation
n_repeats <- 1000
alpha <- 0.1
hdi_coverage_pr <- function(n_datapoints) {
n_inclusions <- 0
for (repetition in seq(1:n_repeats)) {
i <- rdunif(K + 1) - 1 # Always generate the data using a uniform prior
true_p <- i/K
x <- rbinom(1, n_datapoints, true_p)
post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
# This if is just a hacky way to check if true parameter is in the HDI credible interval
if (sum(abs(true_p - high_density_intervals(alpha, post)[1,]) < 10e-10) == 1) {
n_inclusions <- n_inclusions + 1
}
}
return(n_inclusions/n_repeats) # Fraction of simulation where the true parameter was in interval
}
df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)
ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
ylim(0, 1) +
geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
geom_line()

- Suprise!!: in a Bayesian well-specified context, calibration holds for all dataset sizes!
- Contrast with typical frequentist intervals, usually calibrated only for large enough dataset size
- This is because frequentist interval need to rely on asymptotics to be computable (Central Limit Theorem)
- Bonus 1: this is a great way to test correctness of your software implementation.
- Correct Bayesian inference code: calibrated.
- Buggy Bayesian inference code: almost never calibrated.
- It is nice because you can run your test on small dataset where inference is fast.
- Pick the test just big enough to get code coverage.
- Bonus 2: could you use this to perform model “goodness-of-fit”? How?
- Harness the fact that for a Bayesian, prediction and parameter estimation are done in the same framework
- First, let us pick a simple setup, say iid univariate observations y=(y1,y2,…,yn)
- Let y−i denote all the observations excluding the i-th one, y−i=(y1,y2,…,yi−1,yi+1,…,yn)
- Idea: if the model is well-specified, P(Yi∈Cα(Y−i))=α
- Approximate LHS using leave-one-out: P(Yi∈Cα(Y−i))≈1n∑iI[yi∈Cα(y−i)] where I[⋯] denotes the indicator on the event [⋅], i.e. a function that is 1 if the input outcome s is in the event, 0 otherwise.
- Special case: cross-validated probability integral transform (PIT), see this paper for practical methods to compute it using importance sampling and other tricks, which uses simple credible interval of the form Cα=(−∞,x] (my hunch: it might be better to use HDI; makes no difference if the model is well-specified, but it might in the misspecied case, which matters since we are doing goodness-of-fit!)
- PIT is related to the more commonly used but less formal “posterior predictive check”, where data point i is not excluded
- easier to compute (only one posterior instead of n posteriors)
- no theoretical guarantees and hence harder to interpret
- useful as a quick and dirty visualization trick
- Summary:
- People often use “synthetic data” as follows:
- Generate data, hold out the true parameter θ⋆
- Report distance between inferred θ and θ⋆
- Now how to interpret this distance? Could be unclear.
- Message here: there are other ways to use synthetic data in a Bayesian context!
Why well-specified credible intervals are calibrated for any fixed dataset sizes

- Denote the data by X
- Parameter of interest: I
- Denote the credible interval by C(X)
- We want to show P(I∈C(X))=0.9
- By construction (see also earlier footnote for detail), the interval satisfies, for all observed data x, P(I∈C(x)|X=x)=0.9
- Recall the law of total probability:
- for any partition Ei of the sample space,
- we have P(A)=∑iP(A,Ei)
- Here: A=(I∈C(X)), Ei=(X=xi)
- where {xi} denotes the possible values the data can potentially take across all possible simulated experiments.
- Putting it all together and applying chain rule: P(I∈C(X))=∑iP(I∈C(X),X=xi)=∑iP(X=xi)P(I∈C(xi)|X=xi)=∑iP(X=xi)0.9=0.9∑iP(X=xi)=0.9.
- Note: nice way to write LTE and direct consequence of technical def of conditional expectation: P(A|X) should be viewed as a random variable which is a function of X, P(A|X)=f(X) for some f, so P(A)=E[P(A|X)]
Calibration of credible intervals (misspecified case)
Now suppose we are using the “wrong prior”, i.e. data generation uses uniform prior but we base or posterior computation on the non-uniform prior from earlier
Again, let’s do it for small (10 launches), medium (100), and large datasets (1000), plotting the nominal coverage (dashed) against the actual coverage (solid line)
# Using now the same non-uniform prior as before for posterior calculation
prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)
set.seed(1)
K <- 1000
n_repeats <- 1000
alpha <- 0.1
df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)
ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
ylim(0, 1) +
geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
geom_line()

- Bad news: for small datasets we are no longer calibrated, in the worst possible way
- Higher than dash line: would have meant inference is being conservative, i.e. more right than it actually claimed. That’s not too bad.
- Lower than dash line: we are being overconfident or anti-conservative, which in some case can be reckless
- Good news: this gets quickly corrected as dataset gets larger. Why?
Asymptotic calibration under certain misspecification
- There is no general calibration theory for misspecified models, only special cases (outside of that, use simulation studies or cross-validation techniques)
- Setup in which we have a theorem: graphical model of the following form

Bernstein-von Mises: under certain conditions, even when the prior is misspecified the actual coverage of credible intervals converges to the nominal coverage. (converge just means “gets arbitrarily close”)
Conditions include, but are not limited to:
- θ∈Θ needs to live in a continuous rather than discrete space! I.e.: Θ⊂Rd
- Intuition: Bernstein-von Mises actually proves something stronger: convergence of the rescaled, centered posterior to a normal distribution, which is a continuous distribution (can you guess the scaling?)
- “Bayesian Central Limit Theorem”
- p(θ) (the prior) is now a density instead of a PMF,
- we assume it is non-zero in a neighborhood of the true parameter
- we assume the posterior mean is consistent
- recall: that means the error goes to zero as number of data points increases
- we assume the model is parametric:
- θ is allowed to be a vector, but you are not allowed to make it bigger as the number of observations increases
We will explore these conditions in more detail.
Survey: who is familiar with asymptotic normality of the MLE?
Practical consequence:
- for large datasets, and under certain condition, the prior is “washed away” by the likelihood
- The conditions are not crazy and hold in some situations of interest
- But there are also simple practical situations where the conditions do not hold (e.g. prior assigning zero mass to important regions)
Continuous random variables: motivation
- Example: p∈[0,1] instead of p∈{0,1/K,2/K,…,1}
- Motivations to having continuous random variables:
- Discretiztion becomes exponentially expensive
- We want Bernstein-von Mises to kick in
- Annoying to select a discretization size (K in our rocket example)
Continuous random variables: first implementation
- Question: how to store the posterior distribution of a continuous random variables in a computer?
- With discrete random variable: we could do it with an array of size K
- For continuous random variables: use a density (first answer)
Recall: how to use a density to compute a probability?
- Say a random variable X has density f
- I want to know the probability that X is between say 0.1 and 0.3
- Answer:

P(X∈[0.1,0.3])=∫0.30.1f(x)dx 
- At this point, you should have tears in your eyes: integrals are nasty
- No closed form in general
- Even if we pick nice prior with closed form integrals, posterior integrals might be nasty
- Recall in the rocket example, we observed that even if we start with uniform prior posterior is not uniform
- If X=(X1,X2,…,Xd) is a random vector (stacked random variables), we need d iterated integrals
- Numerical approximation (trapezoid, etc) does not scale to models having many dependent random variables
Monte Carlo as a black box
Motivation:
- we want continuous continuous random variables without having to compute integrals (no closed form, >10 dimensions is too slow to do numerical integration)
- we want combinatorial objects without hard sums
However, for simplicity, let us start by assuming the model has only 1 real valued unknown parameter.
Idea:
- Store a set S of random points called Monte Carlo samples,
- chosen such that probabilities can be approximated by:
P(X∈[0.1,0.3])≈# points in S that fall in [0.1,0.3]M
Intuition. Let’s say you want to compute the area of this weird curved shape A:

Calculus: need the following double integral over the weird shape A ∫∫Adxdy
- Monte Carlo method:
- Simulate random points in the square S={X1,X2,…,XM}
- Return the fraction of the points that fall inside the weird shape A
Note: all we need to do for each point Xi is to check if the point falls in the region or not. This is called pointwise evaluation. No need to compute complicated integrals.
How to create Monte Carlo samples S
Guiding principle: we have two ways to compute probabilities: densities and Monte Carlo. We want that as the size of S increases, the discrepancy between the two methods should become arbitrary close to zero
- Mathematically: we want as M→∞, # points in S that fall in [0.1,0.3]M→∫0.30.1f(x)dx with “overwhelming probability”.
- This property is called a “Law of Large Numbers”
- There are several ways to get a “Law of Large Numbers” (a fact not emphasized enough!)
- Independent and identically distributed random variables Xi (the case you may be familiar with, but not the only one!)
- The list of state visited Markov chain with a long term behaviour having density f (called stationary density)
- But for now you can still treat S as magically provided by a PPL black box
Recap: computing a mean

E[X]=∫xf(x)dx
- Monte Carlo way (much easier to implement):
E[X]≈∑Mi=1XiM
Again, Laws of Large Numbers link the two
In our weird shape example, separately average the positions for each dimension over the points that fall in the shape.
For variance, use Var[X]=E[X2]−(E[X])2, we have the second term already then for the first term
E[X2]≈∑Mi=1X2iM
Makes sense: think about it as defining a new random variable Y=X2 and computing the mean of that transformed random variable…
Generalizing this trick: if I give you a nice enough function g such as g(x)=x2,
E[g(X)]≈∑Mi=1g(Xi)M
More asymptotic regimes
We have done two thought experiments so far:
- Letting the number of observations (rocket launches) go to infinity, N→∞
- Data collection budget
- Often no control on that
- Letting the number of Monte Carlo samples go to infinity, M→∞
- Computational budget
- If not happy with results, easy to increase

- Pop quick: is the picture on the bottom right always a point mass?
Note:
- for “nice” model such as this one, as N→∞ the posterior gets arbitrarily peaky (“converges to a point mass”).
- Conditions will be stated next week.
- But there are less nice models where we do not get convergence to a point mass as the data size goes to infinity.
- For example, recall the behaviour for unidentifiable models

- These will often be harder to sample in practice
- People always think about “multi-modal” problem to motivate advanced MCMC
- Multi-modality do arise and cause problem, but in my experience, the more serious problems are rather unidentifiable or weakly identifiable models in practice
De Finetti: intuition and motivation
- We have invoked the “exchangeable/iid” setup several times already
- De Finetti theorem will give us more insight in this class of models
- Motivation:


- Often, may not want to rely on the order of the rows in the spreadsheet given to you. Just shuffle it to be safe.
- In such circustance, de Finetti says that there exists a well-specified model with the graphical model on the right
- I.e. that there exists a prior such that the data is iid given that prior
- This motivation gives you an idea of the theorem, but is not actually 100% exactly what de Finetti says, let’s formalize the theorem!
Exchangeable random variables
- Recall: notion of equality in distribution X1d=X2
- …this means the distribution of X1 is equal to the distribution of X2
- …concretely: draw the marginal PMF or density of each random variable, check if they match
- Example where X1d=X2 but X1≠X2?
- Highlight to reveal the answer: You flip a coin and say what is on the top X1. Your friend say what is on the bottom X2
- Two random variables are exchangeable if (X1,X2)d=(X2,X1)
- Example: indicator variable (i.e. Bernoulli random variables).
- Develop a criterion for checking from the joint density if it’s exchangeable.
- Highlight to reveal the answer: It should be symmetric with respect to the line y=x
- Show the indicator on a square, a circle, and checkers board centered at zero leads to exchangeable random variables. Which one is iid?
- Highlight to reveal the answer: Only the square is iid
- Note: exchangeability implies identical distributions
- Note: this notion is closely related to reversibility
- Generalization: n variables are exchangeable if for all permutations σ:{1,2,…,n}→{1,2,…,n}∈Sn, (X1,X2,…,Xn)d=(Xσ(1),Xσ(2),…,Xσ(n))
- Infinite exchangeable: an infinite sequence of random variable (X1,X2,…) is exchangeable if all finite subsets are exchangeable.
Space, Right Arrow or swipe left to move to next slide, click help below for more details