STAT 520 - Bayesian Analysis
Alexandre Bouchard-Côté
2/27/2019
Goal today
- Prediction and the “do not plugin principle”
- Calibration
- Graphical models
- Bayesian asymptotics 101
Bayesian statistics: short historical overview
- Precursors:
- Thomas Bayes (1702–1761); first special case of Bayes rule, published posthumously in 1763
- Pierre-Simon Laplace (1749–1827): generalizations, more applications
- Idea temporarily buried by frequentists in the 1920’s
- Second wave: theoretical foundations
- Bruno de Finetti, 1930: partial justification for exchangeable data
- Stein paradox and crisis in frequentist statistics (1955)
- Objective-Subjective Bayes ‘divide’
- Popularization of Subjective Bayes by Leonard Savage in the ’50s
- Inception of the Objective Bayes school: Harold Jeffreys (1939), further development by José-Miguel Bernardo (1979)
- Third wave: coming of age as a versatile data analysis tool
- Inception: Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller and Edward Teller (1953), Metropolis-Hastings algorithm in the physics literature
- Introduction to Bayesian statistics: Stuart Geman and Donald Geman (1984)
- First major PPL: WinBUGS, David Lunn, Andrew Thomas, Nicky Best,David Spiegelhalter (2000)
Recap: toy model

- Step 1: Pick one of the \(K+1\) coins from the bucket
- \(\leftrightarrow\) Design a new rocket
- Step 2: Repeatedly flip the same coin
- \(\leftrightarrow\) Build a bunch of copies of the rocket, launch them and enjoy the show
Recap: decision tree, random variable and PMF all in one picture

- Random variable: maps an outcome (leaf in decision tree) to a label or “realization”
- Here: the index \(I\) of the coin is the plotted random variable
- Realization is \(i \in \{0, 1, 2\}\)
- Height of each green bar in the PMF?
- Add up the probability of each path
- Probability of each path is the product of conditionals along the path
- PMF shown here with x and y axis flipped to make it easier to draw
Posterior PMF
- Visualization of Bayes rule:
- Kill paths incompatible with observation
- The killing removes mass on some of the spikes
- Renormalize the PMF

- Note:
- We start with a prior PMF in the uniform family
- The posterior PMF is not in the uniform family
- Optional exercise + office hour:
- Write a formula for this idea (killing-renormalization)
- Show it corresponds to Bayes rule in a special case
- Everything works the same way with densities, including the generalized notion of density (Radon-Nikodym derivative)
Recap: probabilist’s notation
- Notation to select subset of leaves
- For example: \((I < 2)\) shown in red below

Mathematically: \((I < 2) = \{s : I(s) < 2\}\)
Recap: declaration of conditionals
\[\begin{align}
I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\
X_n | I &\sim {\text{Bern}}(I/K); n \in \{1, 2, 3\}
\end{align}\]

Recall:
- Interpretation of \(I\): index of the coin I pick, \(X_n\), the \(n\)-th flip
- The notation \(X_n | I \sim {\text{Bern}}(I/K)\) means..
- The conditional PMF of the random variable \(X_n\) given \(I\) is \({\text{Bern}}(I/K)\)
- I.e. for any \(x \in {\mathbf{R}}\), \(I \in \{1, 2, \dots, K\}\), the following conditional probability is obtained by looking up the picture on the right
- Or, mathematically
\[\begin{align}{\mathbb{P}}(X_n = x | I = i) =
\begin{cases}
i/K, & \text{if } x = 1 \\
1 - i/K, & \text{if } x = 0 \\
0, & \text{otherwise}
\end{cases}
\end{align}\]
- We will also make use of the continuous version of this model, “letting \(K \to \infty\)”,
\[\begin{align}
p &\sim {\text{Unif}}(0, 1) \\
X_n | p &\sim {\text{Bern}}(p); n \in \{1, 2, 3\}
\end{align}\]
More on decision trees: drawing your own

Recall we computed each \(\gamma(i)\) by multiplying conditional probabilities on the edges of a path in a decision tree. I gave you the decision tree, how to do draw your own decision tree from scratch?
Idea:
- write down the model as we did before \[\begin{align}
I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\
X_n | I &\sim {\text{Bern}}(I/K); n \in \{1, 2, 3\}
\end{align}\]
- Each chunk of the form \(Y | Z\) tells you that you should make the decisions for random variable \(Y\) after those for random variable \(Z\).
Rationale: mathematically, chain rule works in any order (\({\mathbb{P}}(A, B) = {\mathbb{P}}(A){\mathbb{P}}(B|A) = {\mathbb{P}}(B){\mathbb{P}}(A|B)\)), but here we want to pick an order such that all the conditional distributions are known (part of the model specification)
Example: for the coin bucket model, this tells us we should draw \(I\) first, then the \(X_n\)’s (and that the order among the \(X\)’s does not matter).
Large decision trees: graphical models

Question: how to do this for a large model? For example: \[\begin{align*}
X_1 &\sim {\text{Unif}}(0, 1/2, 1) \\
X_2 | X_1 &\sim {\text{Bern}}(X_1) \\
X_3 | X_1, X_4 &\sim {\text{Unif}}\{1, X_1, X_4\} \\
X_4 &\sim {\text{Bern}}(1/3) \\
X_5 | X_3 &\sim {\text{Unif}}\{0, .., X_3\} \\
X_7 &\sim {\text{Bern}}(1/4) \\
X_6 | X_4, X_7 &\sim {\text{Unif}}\{X_4, X_7\} \\
X_8 &\sim {\text{Bern}}(1/4)
\end{align*}\]
Idea:
- Build a graph, where nodes are random variables and edges are dependencies coming from the given conditional distributions. If you see \(X | Y\), add an edge \(Y \to X\).
- Use an ordering of the nodes that respects the directionality of all the edges.
Terminology: this graph is called a (directed) graphical model or Bayes net
Generating synthetic datasets
- Graphical models also provide a way to generate synthetic datasets.
- Simply follow the same ordering as discussed in the previous slide
- but this time sample realizations of each (conditional) distribution.
- Much cheaper than decision trees! Linear complexity instead of exponential.
- Useful for:
- benchmarking as the synthetic dataset hence created contains both observation and the “unknowns”
- defining important concepts such as calibration and optimality (soon)
Some imperfect but useful terminology
- One graphical model that occurs a lot is this one (sometimes called an “exchangeable” or “iid” setup" for reasons to be discussed later)

- In this setup, a convention is:
- to call the distribution on \(\theta\) the “prior”
- to call the conditional distribution of the observation given the prior, the “likelihood”
- However keep in mind the Bayesian paradigm applies to any graphical models, not just this one.
- In some other graphical models, it may be awkward to decide what is the “prior” and what is the “likelihood”.
- But to perform Bayesian inference, all we need is a joint distribution at the end of the day.
- However, many theoretical results do assume the exchangeable setup
Prediction

Question: given you observe the same data (3 Heads in a row), what is your belief that the next draw is again a Heads?
\[\begin{align}
I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\
X_n | I &\sim {\text{Bern}}(I/K); n \in \{1, 2, 3\}
\end{align}\]
Prediction: the right approach

- Add a random variable for the quantity you want to predict (next flip \(N\)).
- Add one level in the decision tree.
- Use Bayes rule.
- Twist: distinct paths lead to the same prediction
- Sum the probabilities of the paths you collapse (why can we do this?).
- Two ways to get \((H_{1:3}, N = \text{H again})\): \(\gamma(1) = 1/48 + 1/3\)
- Only one way to get \((H_{1:3}, N = \text{T})\): \(\gamma(0) = 1/48\)
- As before, normalize \(\gamma\) to get \(\pi\): \[{\mathbb{P}}(N = \text{H again}|H_{1:3}) = \frac{\gamma(1)}{\gamma(0) + \gamma(1)} = 17/18\]
Prediction
Let now \(\pi\) denote a vector where the first component is \({\mathbb{P}}(N = \text{Tails}|H_{1:3})\) and the second, \({\mathbb{P}}(N = \text{Heads}|H_{1:3})\).
- If you see “Heads, Heads, Heads” updated beliefs: \(\pi = (1/18, 17/18)\)
- Interpretation: “I predict another Heads with 94% confidence”
- Exercise: if you see “Heads, Tails, Tails” or any other mix of the two: \(\pi = (1/2, 1/2)\)
- Interpretation: “I do not lean on either predictions” (if forced, flip your own coin!)
- By symmetry: if you see “Tails, Tails, Tails”: \(\pi = (17/18, 1/18)\)
- Interpretation: “I predict another Tails with 94% confidence”
Prediction: “bad approach”
In the prediction exercise, some of you may have used the following argument:
- Compute a point estimate on \(p\).
- Then look at the probability of H and T for a Bernoulli with the point estimate plugged-in
Warning: in general this could lead to a suboptimal way to update beliefs.
But: in this specific example, if you take the point estimate to be the posterior mean, you will find the same answer. This is not true in the vast majority of cases! Here this works because the Bernoulli has a single parameter, its mean. Try it for other models next week, you will see the plug-in method gives a different, sub-optimal answer.
Calibrated prediction uncertainty
Question: In what sense a belief update is “optimal”?
Consider the following experiment based on repeated synthetic data generation followed by prediction evaluation
set.seed(1)
rbern <- function(n, prob) { return(rbinom(n, 1, prob)) }
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max
# We will keep track for each synthetic dataset of
# whether we successfully predicted (we know truth because we are using synthetic)
success <- c()
nIters <- 10000
for (iteration in seq(1:nIters)) {
# Data generation
i <- rdunif(3) - 1 # Uniform on {0, 1, 2}
x <- rbern(3, i/2.0)
nextThrow <- rbern(1, i/2.0)
nHeads <- sum(x)
# Record prediction statistics for the case of interest
if (nHeads == 3) {
pred <- 1
success <- c( success, (if (pred == nextThrow) 1 else 0))
}
}
# Plot evolution of the statistics across iterations
df <- data.frame("iteration" = c(1:length(success)), success)
df <- df %>%
mutate(
successRate = cumsum(success) / seq_along(success)
)
ggplot(data=df, aes(x=iteration, y=successRate)) +
geom_hline(yintercept=17/18, linetype="dashed", color = "black") +
geom_line()

Calibration
- This property, being right 94% of the time when you say you are 94% sure“, is called calibration
- For Bayesian methods, calibration is always satisfied if the model is true
- Things are more complicated if the model is not true (“M-open”), more on that in the exercise.
Exercise 1, question 3:
- Back to the problem of estimation of \(p\), let us consider first a uniform prior with \(K = 1000\) say. Generate 1000 datasets each of size 10 from the model and use your posterior and HDI function at 90% to perform Bayesian inference. Is the HDI calibrated? What does it mean exactly in the context of a well-specified Bayesian model? (cf a frequentist model)
- Now generate the data from a different prior, say a discretized normal distribution with a mean of 0.2 and standard deviation of 0.2. How does this affect calibration? Do the same for a case where each dataset has size 100 instead. Then when each has size 1000.
Exercise 1, optional question: apply your posterior inference method to real data. For example, the rocket launch data.
Exercise 1, optional question: a randomized algorithm can be viewed as a function that takes as input a special object which provides access to randomness. Let us assume this object only provides one function, called “bernoulli(p)”, to access iid Bernoullis. Write a function that takes an arbitrary randomized algorithm as output and prints out the decision tree.
Accuracy of point estimate as a function of number of observations?
- You are asked for one more significant digit in your point estimate,
- how many additional observations should you collect?
- Let’s use some code based on synthetic datasets (similar idea to calibration experiment from last time)
set.seed(1)
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max
K = 1000
# Later, we will use different prior to generate data versus do inference,
# but for now, use the same (uniform on the K+1 grid points)
prior_used_for_computing_posterior <- rep(1/(K+1), K+1)
simulate_posterior_mean_error <- function(n_datapoints) {
i <- rdunif(K + 1) - 1 # Always generate the data using a uniform prior
true_p <- i/K
x <- rbinom(1, n_datapoints, true_p) # Recall: a Binomial is just a sum of iid Bernoullis
post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
post_mean <- posterior_mean(post)
return(abs(post_mean - true_p)) # Return the error by comparing posterior mean and p used to generate data
}
Accuracy of point estimate: empirical results
- Calling our new function to get errors for:
- 1000 datasets each of size 10 (= 10^1)
- 1000 datasets each of size 32 (= 10^(3/2))
- …
set.seed(1)
df <- data.frame("n_observations" = rep(c(10, 32, 100, 316, 1000), each=1000))
df$error_well_specified <- sapply(df$n_observations, simulate_posterior_mean_error)
ggplot(data=df, aes(x=n_observations, y=error_well_specified+1e-5)) +
stat_summary(fun.y = mean, geom="line") + # Line averages over 1000 datasets
scale_x_log10() + # Show result in log-log scale
scale_y_log10() +
geom_point()

- Good news: error goes to zero
- This property is called consistency
- Viewed as bare minimum of statistical procedures
- But here we are interested in the rate (how fast it goes to zero?)
Result suggests a linear fit in the log-log scale \(\underbrace{\log_{10}(\text{error})}_{y} = a\; \underbrace{\log_{10}(\text{number of observation})}_{x} + b\)
Eyeball the coefficient \(a = \Delta x / \Delta y\)
Accuracy of point estimate as a function of number of observations
\(\underbrace{\log_{10}(\text{error})}_{y} = a \underbrace{\log_{10}(\text{number of observation})}_{x} + b\)
we get \(\text{error} = O((\text{number of observations})^{-1/2})\)
Similarly to the frequentist setting, for many models, root mean square error will scale like \[\frac{\text{constant}}{\sqrt{\text{number of observations}}}\]
- So if we want to reduce it by factor 10 (i.e. “gain one significant digit”), need 100x more observations
To gain 2 more significant digits: need 10,000x more observations, etc
What we did here is an example of an (empirical) asymptotic analysis
What if we used the “wrong” prior
- Recall the non-uniform prior from the exercise question:
prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)
plot(seq(0, 1, by=1/K), prior_used_for_computing_posterior)

- The code below will add another curve to the error plot
- In this new curve, the posterior is computed based on the wrong (non-uniform) prior
- But we still simulate the “true” \(p\) uniformly, hence the misspecification
set.seed(2)
df$error_mis_specified <- sapply(df$n_observations, simulate_posterior_mean_error)
df <- gather(data=df, key=prior_type, value=error, error_mis_specified, error_well_specified)
ggplot(data=df, aes(x=n_observations, y=error+1e-5, colour=factor(prior_type))) +
stat_summary(fun.y = mean, geom="line") +
scale_x_log10() +
scale_y_log10() +
geom_point()

- The misspecified model exhibits slightly lower performance when there are 10 observations,
- but the performance of the two models become very close as the number of observations increases
- Do you think this will always be like this for any misspecified models?
First way to break consistency: violating Cromwell’s rule
- Suppose we put zero prior mass to having a fair dice. What happens on the posterior?

If we put prior mass of zero to some event, say \(\pi_0(A) = 0\) then the posterior probability of that event will always be zero, \(\pi_n(A) = 0\), no matter how many observations \(n\) we get
Cromwell’s rule (Oliver Cromwell, 1650): “I beseech you, in the bowels of Christ, think it possible that you may be mistaken.”
Lecture ended here
Second way to break consistency: Unidentifiable models
When there are too many parameters, you may not be able to recover them even if you had “infinite data”
- Example:
- Let us modify the prior for the rocket example (continuous version), setting \(p = p_1 p_2\) where \(p_1 \sim {\text{Unif}}(0,1)\) and \(p_2 \sim {\text{Unif}}(0,1)\).
- Intuition: there are pairs of \((p_1, p_2)\) that yield the same joint density, e.g. \((p_1, p_2) = (0.1, 0.2)\) and \((p_1, p_2) = (0.5, 0.4)\).
Model code in the Blang PPL
model Unidentifiable {
...
laws {
p1 ~ ContinuousUniform(0, 1)
p2 ~ ContinuousUniform(0, 1)
nFails | p1, p2, nLaunches ~ Binomial(nLaunches, p1 * p2)
}
}

- Observations:
- The posterior does not become degenerate as the sample size increases
- It is now asymmetric. This is not a bug! Can you figure out why?
- This is an example of what is called an unidentifiable model. There are two distinct parameters, \(\theta_1\) and \(\theta_2\), such that \(\theta_1 \neq \theta_2\) yet the likelihood are exactly equivalent \({\mathbb{P}}_{\theta_1} = {\mathbb{P}}_{\theta_2}\).
Theoretical justification of consistency: Doob’s consistency theorem
In the exchangeable setup, outside of these two failure modes, we are guaranteed to have consistency. The details of the statement of the theorem are somewhat delicate and need to be understood to correctly understand the result.
- Frequentist notions of consistency under \(\theta\):
- setup:
- a parametric model, \(\{{\mathbb{P}}_\theta:\theta \in \Theta\}\), defined on space \(\Omega\)
- observed random variables \(X_i : \Omega \to {\mathbf{R}}^d\)
- estimator \(T_n(x_1, x_2, \dots, x_n)\) for \(\theta\)
- consistency:
- compose the observation r.v. with the estimator: \(T_n = T_n(X_1, X_2, \dots, X_n)\)
- ask that \(T_n \to \theta\) as \(n \to \infty\), under a suitable notion of convergence of random variables…
- either “in \({\mathbb{P}}_\theta\)-probability” or “\({\mathbb{P}}_\theta\) almost sure” (latter AKA a.s. or strong consistency)
- in probability: \(\epsilon > 0, {\mathbb{P}}_\theta(d(T_n, \theta) < \epsilon) \to 1\) for some metric \(d\)
- almost sure: \({\mathbb{P}}_\theta(T_n \to \theta) = 1\)
- frequentists ask that the above holds for all \(\theta\)
- Bayesian notion of consistency under \(\theta\)
- setup:
- assume the “exchangeable setup”

- let \(\Pi_n(\cdot) = \Pi_n(\cdot | X_1, X_2, \dots, X_n)\) denote the (random) posterior (why random: as for the frequentist case, composition of the observations random variable with the posterior update map)
- ask that \(\Pi_n(A) \to \delta_\theta(A)\) for any \(A\), under a suitable notion of convergence of random variables…
- either “in \({\mathbb{P}}_\theta\)-probability” or “\({\mathbb{P}}_\theta\) almost sure” (latter AKA a.s. or strong consistency)
- in probability: \(\epsilon > 0, {\mathbb{P}}_\theta(|\Pi_n(A) - \delta_\theta(A)| < \epsilon) \to 1\)
- almost sure: \({\mathbb{P}}_\theta(\Pi_n(A) \to \delta_\theta(A)) = 1\)
- Bayesians ask that the above holds for a set of “good” \(\theta\)’s, denoted G, such that this good set has probability one under the prior \(\pi_0(G) = 1\).
Theorem (Doob’s consistency): if an exchangeable model is identifiable, Bayesian consistency holds for this model.
Corollary: if you use the wrong prior, \(\tilde \pi_0\), but \(\tilde \pi_0(A) > 0 \Longleftrightarrow \pi_0(A) > 0\), then Doob’s consistency still hold for identifiable exchangeable models.
Note: the same holds true for any point estimate based on the posterior distribution (if you get the posterior right, you will get right anything that’s derived from it)
Continuous random variables: motivation
- Example: \(p \in [0, 1]\) instead of \(p \in \{0, 1/K, 2/K, \dots, 1\}\)
- Motivations to having continuous random variables:
- Discretiztion becomes exponentially expensive
- We want Bernstein-von Mises to kick in (soon)
- Annoying to select a discretization size (\(K\) in our rocket example)
Continuous random variables: first implementation
- Question: how to store the posterior distribution of a continuous random variables in a computer?
- With discrete random variable: we could do it with an array of size \(K\)
- For continuous random variables: use a density (first answer)
Recall: how to use a density to compute a probability?
- Say a random variable \(X\) has density \(f\)
- I want to know the probability that \(X\) is between say 0.1 and 0.3
- Answer:

\[{\mathbb{P}}(X \in [0.1, 0.3]) = \int_{0.1}^{0.3} f(x) {\text{d}}x\]
- At this point, you should have tears in your eyes: integrals are nasty
- No closed form in general
- Even if we pick nice prior with closed form integrals, posterior integrals might be nasty
- Recall in the rocket example, we observed that even if we start with uniform prior posterior is not uniform
- If \(X = (X_1, X_2, \dots, X_d)\) is a random vector (stacked random variables), we need \(d\) iterated integrals
- Numerical approximation (trapezoid, etc) does not scale to models having many dependent random variables
Monte Carlo as a black box
Motivation:
- we want continuous continuous random variables without having to compute integrals (no closed form, >10 dimensions is too slow to do numerical integration)
- we want combinatorial objects without hard sums
Idea:
- Store a set \(S\) of random points called Monte Carlo samples,
- \(S = \{X_1, X_2, \dots, X_M\}\),
- chosen such that probabilities can be approximated by:
\[{\mathbb{P}}(X \in [0.1, 0.3]) \approx \frac{\# \text{ points in } S \text{ that fall in }[0.1, 0.3]}{M}\]
Intuition. Let’s say you want to compute the area of this weird curved shape \(A\):

Calculus: need the following double integral over the weird shape \(A\) \[\int \int_A {\text{d}}x {\text{d}}y\]
- Monte Carlo method:
- Simulate random points in the square \(S = \{X_1, X_2, \dots, X_M\}\)
- Return the fraction of the points that fall inside the weird shape \(A\)
Note: all we need to do for each point \(X_i\) is to check if the point falls in the region or not. This is called pointwise evaluation. No need to compute complicated integrals.
How to create Monte Carlo samples \(S\)
Guiding principle: we have two ways to compute probabilities: densities and Monte Carlo. We want that as the size of \(S\) increases, the discrepancy between the two methods should become arbitrary close to zero
- Mathematically: we want as \(M \to \infty\), \[\frac{\# \text{ points in } S \text{ that fall in }[0.1, 0.3]}{M} \to \int_{0.1}^{0.3} f(x) {\text{d}}x\] with “overwhelming probability”.
- This property is called a “Law of Large Numbers”
- There are several ways to get a “Law of Large Numbers” (a fact not emphasized enough!)
- Independent and identically distributed random variables \(X_i\) (the case you may be familiar with, but not the only one!)
- The list of state visited Markov chain with a long term behaviour having density \(f\) (called stationary density)
- But for now you can still treat \(S\) as magically provided by a PPL black box
Recap: computing a mean

\[{\mathbf{E}}[X] = \int x f(x) {\text{d}}x\]
- Monte Carlo way (much easier to implement):
\[{\mathbf{E}}[X] \approx \frac{\sum_{i=1}^M X_i}{M}\]
Again, Laws of Large Numbers link the two
In our weird shape example, separately average the positions for each dimension over the points that fall in the shape.
For variance, use \(\text{Var}[X] = {\mathbf{E}}[X^2] - ({\mathbf{E}}[X])^2\), we have the second term already then for the first term
\[{\mathbf{E}}[X^2] \approx \frac{\sum_{i=1}^M X^2_i}{M}\]
Makes sense: think about it as defining a new random variable \(Y = X^2\) and computing the mean of that transformed random variable…
Generalizing this trick: if I give you a nice enough function \(g\) such as \(g(x) = x^2\),
\[{\mathbf{E}}[g(X)] \approx \frac{\sum_{i=1}^M g(X_i)}{M}\]
- Again, all we need is “pointwise evaluation”, even if \(g\) is complicated.
More asymptotic regimes
We have done two thought experiments so far:
- Letting the number of observations (rocket launches) go to infinity, \(N \to \infty\)
- Data collection budget
- Often no control on that
- Letting the number of Monte Carlo samples go to infinity, \(M \to \infty\)
- Computational budget
- If not happy with results, easy to increase

Note:
- for “nice” model such as this one, as \(N \to \infty\) the posterior gets arbitrarily peaky (“converges to a point mass”).
- Conditions will be stated next week.
- But there are less nice models where we do not get convergence to a point mass as the data size goes to infinity.
- For example, this is the case for unidentifiable models::
