DSCI 553 - Statistical Inference II: Bayesian Analysis
Alexandre Bouchard-Côté
2/7/2019
Goal today
Illustrate different issues and techniques that arise in Bayesian analysis using our rocket/coin flip model.
- Credible intervals
- Prediction
- Calibration
- Graphical models
Again, we will cover more advanced examples next week using PPLs.
Bayesian statistics: short historical overview
- Precursors:
- Thomas Bayes (1702–1761); first special case of Bayes rule, published posthumously in 1763
- Pierre-Simon Laplace (1749–1827): generalizations, more applications
- Idea temporarily buried by frequentists in the 1920’s
- Second wave: theoretical foundations
- Bruno de Finetti, 1930: partial justification for exchangeable data
- Stein paradox and crisis in frequentist statistics (1955)
- Objective-Subjective Bayes ‘divide’
- Popularization of Subjective Bayes by Leonard Savage in the ’50s
- Inception of the Objective Bayes school: Harold Jeffreys (1939), further development by José-Miguel Bernardo (1979)
- Third wave: coming of age as a versatile data analysis tool
- Inception: Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller and Edward Teller (1953), Metropolis-Hastings algorithm in the physics literature
- Introduction to Bayesian statistics: Stuart Geman and Donald Geman (1984)
- First popular PPL: WinBUGS, David Lunn, Andrew Thomas, Nicky Best,David Spiegelhalter (2000)
Recall: rocket problem
- Would you rather get strapped to…
- “shiny rocket”: 1 success, 0 failures
- “rugged rocket”: 98 successes, 2 failures

Recall: uncertainty estimates
- Take-home message:
- Point estimates are often insufficient, and can be very dangerous
- We want some measure of uncertainty
- Bayesian inference provides one way to build uncertainty measures
- Bayesian measures of uncertainty we will describe: credible intervals
Recall: Bayesian inference, high-level picture
- Construct a probability model including
- random variables for what we will measure/observe
- random variables for the unknown quantities (or latent)
- those we are interested in (“parameters”, “predictions”)
- others that just help us formulate the problem (“nuisance”, “random effects”). Ignore this for now.
- Compute the posterior distribution conditionally on the actual data at hand
- Use the posterior distribution to:
- make prediction (point estimate)
- estimate uncertainty (credible intervals)
- make a decision (more on this later)
Recall: intuition behind conditioning
- The workhorse of Bayesian inference!
- Used to define models (what we will do now)
- And soon, to get point estimates and credible intervals

- Important: we want the result of this procedure, \({\mathbb{P}}(\cdot | E)\) to be a probability when viewed as a function of the first argument for any fixed \(E\).
Recall: definition of conditional probability \({\mathbb{P}}(A | E)\)

- For a query even \(A\), what should be the updated probability?
- Optimal choice: \[{\mathbb{P}}(A | E) = \frac{{\mathbb{P}}(A \cap E)}{{\mathbb{P}}(E)}\]
Recall: “constructing a probability model”
- Step 1: Pick one of the \(K+1\) coins from the bucket
- \(\leftrightarrow\) Design a new rocket
- Step 2: Repeatedly flip the same coin
- \(\leftrightarrow\) Build a bunch of copies of the rocket, launch them and enjoy the show
Better notation
Using our slick probabilist’s notation introduce at the end of last lecture, our model is:
\[\begin{align}
I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\
X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\}
\end{align}\]

Recall:
- Interpretation of \(I\): index of the coin I pick, \(X_m\), the \(m\)-th flip
- The notation \(X_m | I \sim {\text{Bern}}(I/K)\) means..
- The conditional PMF of the random variable \(X_m\) given \(I\) is \({\text{Bern}}(I/K)\)
- I.e. for any \(x \in {\mathbf{R}}\), \(I \in \{1, 2, \dots, K\}\), the following conditional probability is obtained by looking up the picture on the right
- Or, mathematically
\[\begin{align}{\mathbb{P}}(X_m = x | I = i) =
\begin{cases}
i/K, & \text{if } x = 1 \\
1 - i/K, & \text{if } x = 0 \\
0, & \text{otherwise}
\end{cases}
\end{align}\]
Recall: warm-up calculation
- Model
- Step 1: Pick one of the \(K+1 = 3\) coins from the bucket
- Step 2: Repeatedly flip the same coin
- Query
- If the first \(n=3\) flips all return heads, what is the posterior probability that the standard (\(p = 1/2\)) coin was picked?

Recall: approach for warm-up calculation
- Strategy (Bayes’ rule):
- Attack the slightly more general problem \(\pi(i) = {\mathbb{P}}(C_i | H_1, H_2, H_3)\)
- By definition of conditioning \[\pi(i) = \frac{{\mathbb{P}}(C_i, H_1, H_2, H_3)}{{\mathbb{P}}(H_1, H_2, H_3)}\] let us call the numerator \(\gamma(i)\) and the denominator, \(Z\), i.e. \(\pi(i) = \frac{\gamma(i)}{Z}\)
- Start by computing \(\gamma(0), \gamma(1), \gamma(2)\)
- Use chain rule (multiply edge conditional probabilities)
- Normalize: compute \(Z = \gamma(0) + \gamma(1) + \gamma(2)\) and divide the vector \(\gamma\) by \(Z\):
- Mathematical expression for the above algorithm (Bayes’ rule): if \(i \in \{0, 1, \dots, K\}\) indexes a partition \(C_i\) of \(S\), then the posterior distribution, \(\pi(i) = {\mathbb{P}}(C_i | D)\), is proportional to the joint distribution \(\gamma(i) = {\mathbb{P}}(C_i, D)\) in the sense that
\[
\pi(i) = \frac{\gamma(i)}{\sum_{i=0}^K \gamma(i)}
\]
Resulting belief updates

- First lecture: if you see “Heads, Heads, Heads” updated beliefs: \(\pi = (0, 1/9, 8/9)\)
- Exercise: if you see “Heads, Tails, Tails” or any other mix of the two: \(\pi = (0, 1, 0)\)
- By symmetry: if you see “Tails, Tails, Tails”: \(\pi = (8/9, 1/9, 0)\)
More on decision trees: drawing your own

Recall we computed each \(\gamma(i)\) by multiplying conditional probabilities on the edges of a path in a decision tree. I gave you the decision tree, how to do draw your own decision tree from scratch?
Idea:
- write down the model as we did before \[\begin{align}
I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\
X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\}
\end{align}\]
- Each chunk of the form \(Y | Z\) tells you that you should make the decisions for random variable \(Y\) after those for random variable \(Z\).
Rationale: mathematically, chain rule works in any order (\({\mathbb{P}}(A, B) = {\mathbb{P}}(A){\mathbb{P}}(B|A) = {\mathbb{P}}(B){\mathbb{P}}(A|B)\)), but here we want to pick an order such that all the conditional distributions are known (part of the model specification)
Example: for the coin bucket model, this tells us we should draw \(I\) first, then the \(X_m\)’s (and that the order among the \(X\)’s does not matter).
Large decision trees: graphical models

Question: how to do this for a large model? For example: \[\begin{align*}
X_1 &\sim {\text{Unif}}(0, 1/2, 1) \\
X_2 | X_1 &\sim {\text{Bern}}(X_1) \\
X_3 | X_1, X_4 &\sim {\text{Unif}}\{1, X_1, X_4\} \\
X_4 &\sim {\text{Bern}}(1/3) \\
X_5 | X_3 &\sim {\text{Unif}}\{0, .., X_3\} \\
X_7 &\sim {\text{Bern}}(1/4) \\
X_6 | X_4, X_7 &\sim {\text{Unif}}\{X_4, X_7\} \\
X_8 &\sim {\text{Bern}}(1/4)
\end{align*}\]
Idea:
- Build a graph, where nodes are random variables and edges are dependencies coming from the given conditional distributions. If you see \(X | Y\), add an edge \(Y \to X\).
- Use an ordering of the nodes that respects the directionality of all the edges.
Terminology: this graph is called a (directed) graphical model or Bayes net
Generating synthetic datasets
- Graphical models also provide a way to generate synthetic datasets.
- Simply follow the same ordering as discussed in the previous slide
- but this time sample realizations of each (conditional) distribution.
- Much cheaper than decision trees! Linear complexity instead of exponential.
- Useful for:
- benchmarking as the synthetic dataset hence created contains both observation and the “unknowns”
- defining important concepts such as calibration and optimality (soon)
Recall: point estimate (summaries) for PMFs and posterior PMF
- Posterior PMF of a random variable \(X\) given observation \(D\) defined as you would expect, \(\pi(x) = {\mathbb{P}}(X = x | D)\)
- Usual suspects for providing point estimate summaries of a PMF \(p(x)\)
- Mode: a global optimum of the PMF, \(\text{argmax}_x p(x)\)
- In the context of posterior distribution, called the maximum a posteriori (MAP)
- Mean: \(\sum_{x} x p(x)\)
- In the context of posterior distribution, called the posterior mean
Point estimate: example
- Random variable \(X\) with point masses at \(3, 4, 5, 10\) with respective probabilities \(1/5, 2/5, 1/5, 1/5\)
- Mode at \(4\)
- Mean is \((1/5)*3 + (2/5)*4 + (1/5)*5 + (1/5)*10 = 5.2\) denoted \({\mathbf{E}}[X] = 5.2\)

Point estimate: example
- Coin bucket/rocket: posterior distribution on \(p\) given that 3 heads were observed:
- Posterior mean: \(0 + (1/2) (1/9) + (1) (8/9) = 17/18\)
- MAP: \(1\)

Measures of uncertainty
- You are given a level, typically \(95\%\) or \(90\%\), denoted \(1-\alpha = 0.95\) or \(1-\alpha = 0.9\)
- Ideally, we would want to find a set \(T\) such that \(\sum_{x \in T} p(x) = 1-\alpha\)
- For discrete distributions, there may not be a solution, but let us ignore that issue for now (suppose each probability is very small and replace \(=\) by \(\approx\) up to an error bounded by the largest individual probability)
- For continuous distributions that will not be an issue
- There is still a problem: there are many solutions! Some of them might appear weird, e.g. include points of probability zero!
- Example: both sets shown in red contain \(75\%\) of the mass

High Density Intervals (HDI)
- High Density “Intervals”: pick a solution with the least number of points \(|T|\)
- To handle discrete models: \(\text{argmin}\{ |T| : \sum_{x \in T} p(x) \ge 1-\alpha\}\).
- Still not unique in some corner cases, but good enough
- “Interval” in name is unfortunate as it is not actually always an interval
- Example: multimodal distribution
- Sometimes called high density region
Example: the left one below is the HDI because the number of points, \(|T|\) is smaller (\(5 < 9\))

Prediction

Today’s exercise: given you observe the same data (3 Heads in a row), what is your belief that the next draw is again a Heads?
\[\begin{align}
I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\
X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\}
\end{align}\]
Prediction: using posterior probability as the updated belief

Good approach:
- Add a random variable for the quantity you want to predict (next flip \(N\)).
- Add one level in the decision tree.
- Use Bayes rule.
- Twist: distinct paths lead to the same prediction
- Sum the probabilities of the paths you collapse (why can we do this?).
- Two ways to get \((H_{1:3}, N = \text{H again})\): \(\gamma(1) = 1/48 + 1/3\)
- Only one way to get \((H_{1:3}, N = \text{T})\): \(\gamma(0) = 1/48\)
- As before, normalize \(\gamma\) to get \(\pi\): \[{\mathbb{P}}(N = \text{H again}|H_{1:3}) = \frac{\gamma(1)}{\gamma(0) + \gamma(1)} = 17/18\]
Prediction example: all cases
Let now \(\pi\) denote a vector where the first component is \(\P(N = \text{Tails}|H_{1:3})\) and the second, \(\P(N = \text{Heads}|H_{1:3})\).
- If you see “Heads, Heads, Heads” updated beliefs: \(\pi = (1/18, 17/18)\)
- Interpretation: “I predict another Heads with 94% confidence”
- Exercise: if you see “Heads, Tails, Tails” or any other mix of the two: \(\pi = (1/2, 1/2)\)
- Interpretation: “I do not lean on either predictions” (if forced, flip your own coin!)
- By symmetry: if you see “Tails, Tails, Tails”: \(\pi = (17/18, 1/18)\)
- Interpretation: “I predict another Tails with 94% confidence”
Law of total probability
The process of adding the probabilities of the paths is called the law of total probability, mathematically, if \(E_i\) is a partition of the sample space, then for any event \(A\), \[{\mathbb{P}}(A) = \sum_i {\mathbb{P}}(A, E_i)\]
Why it matters: this provides a principled way to handle many data analysis issues such as missing data, prediction, censoring, etc. More example of that later.
Informally: “marginalize” over all the uncertain quantities. In particular, in the Bayesian framework, during prediction, the parameters can be better seen as “nuisance” parameters.
Prediction: “bad approach”
In the prediction exercise, some of you may have used the following argument:
- Compute a point estimate on \(p\).
- Then look at the probability of H and T for a Bernoulli with the point estimate plugged-in
Warning: in general this is a suboptimal way to update beliefs.
But: in this specific example, if you take the point estimate to be the posterior mean, you will find the same answer. This is not true in the vast majority of cases! Here this works because the Bernoulli has a single parameter, its mean. Try it for other models next week, you will see the plug-in method gives a different, sub-optimal answer.
Calibrated prediction uncertainty
Question: In what sense a belief update is “optimal”?
Consider the following experiment based on repeated synthetic data generation followed by prediction evaluation
set.seed(1)
rbern <- function(n, prob) { return(rbinom(n, 1, prob)) }
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max
# We will keep track for each synthetic dataset of
# whether we successfully predicted (we know truth because we are using synthetic)
success <- c()
nIters <- 10000
for (iteration in seq(1:nIters)) {
# Data generation
i <- rdunif(3) - 1 # Uniform on {0, 1, 2}
x <- rbern(3, i/2.0)
nextThrow <- rbern(1, i/2.0)
nHeads <- sum(x)
# Record prediction statistics for the case of interest
if (nHeads == 3) {
pred <- 1
success <- c( success, (if (pred == nextThrow) 1 else 0))
}
}
# Plot evolution of the statistics across iterations
df <- data.frame("iteration" = c(1:length(success)), success)
df <- df %>%
mutate(
successRate = cumsum(success) / seq_along(success)
)
ggplot(data=df, aes(x=iteration, y=successRate)) +
geom_hline(yintercept=17/18, linetype="dashed", color = "black") +
geom_line()

Calibration
- This property, being right 94% of the time when you say you are 94% sure“, is called calibration
- For Bayesian methods, calibration is always satisfied if the model is true
- Things are more complicated if the model is not true (“M-open”), more on that later
New example: change point problem
Moved to second week