Goal today

Illustrate different issues and techniques that arise in Bayesian analysis using our rocket/coin flip model.

Credible intervals
Prediction
Calibration
Graphical models

Again, we will cover more advanced examples next week using PPLs.

Extra resources

Probabilistic Programming for Hackers (Davidson-Pilon)
Doing Bayesian Data Analysis (Kruschke)
Advanced: The Bayesian Choice (Robert)

Bayesian statistics: short historical overview

Precursors:
- Thomas Bayes (1702–1761); first special case of Bayes rule, published posthumously in 1763
- Pierre-Simon Laplace (1749–1827): generalizations, more applications
- Idea temporarily buried by frequentists in the 1920’s
Second wave: theoretical foundations
- Bruno de Finetti, 1930: partial justification for exchangeable data
- Stein paradox and crisis in frequentist statistics (1955)
- Objective-Subjective Bayes ‘divide’
  - Popularization of Subjective Bayes by Leonard Savage in the ’50s
  - Inception of the Objective Bayes school: Harold Jeffreys (1939), further development by José-Miguel Bernardo (1979)
Third wave: coming of age as a versatile data analysis tool
- Inception: Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller and Edward Teller (1953), Metropolis-Hastings algorithm in the physics literature
- Introduction to Bayesian statistics: Stuart Geman and Donald Geman (1984)
- First popular PPL: WinBUGS, David Lunn, Andrew Thomas, Nicky Best,David Spiegelhalter (2000)

Recall: rocket problem

Would you rather get strapped to…
- “shiny rocket”: 1 success, 0 failures
- “rugged rocket”: 98 successes, 2 failures

drawing

Recall: uncertainty estimates

Take-home message:
- Point estimates are often insufficient, and can be very dangerous
- We want some measure of uncertainty
Bayesian inference provides one way to build uncertainty measures
- Bayesian measures of uncertainty we will describe: credible intervals

Recall: Bayesian inference, high-level picture

Construct a probability model including
- random variables for what we will measure/observe
- random variables for the unknown quantities (or latent)
  - those we are interested in (“parameters”, “predictions”)
  - others that just help us formulate the problem (“nuisance”, “random effects”). Ignore this for now.
Compute the posterior distribution conditionally on the actual data at hand
Use the posterior distribution to:
- make prediction (point estimate)
- estimate uncertainty (credible intervals)
- make a decision (more on this later)

Recall: intuition behind conditioning

The workhorse of Bayesian inference!
- Used to define models (what we will do now)
- And soon, to get point estimates and credible intervals

drawing

Important: we want the result of this procedure, \({\mathbb{P}}(\cdot | E)\) to be a probability when viewed as a function of the first argument for any fixed \(E\).

Recall: definition of conditional probability \({\mathbb{P}}(A | E)\)

drawing

For a query even \(A\), what should be the updated probability?
Optimal choice: \[{\mathbb{P}}(A | E) = \frac{{\mathbb{P}}(A \cap E)}{{\mathbb{P}}(E)}\]

Recall: “constructing a probability model”

Step 1: Pick one of the \(K+1\) coins from the bucket
- \(\leftrightarrow\) Design a new rocket
Step 2: Repeatedly flip the same coin
- \(\leftrightarrow\) Build a bunch of copies of the rocket, launch them and enjoy the show

Better notation

Using our slick probabilist’s notation introduce at the end of last lecture, our model is:

\[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\} \end{align}\]

drawing

Recall:

Interpretation of \(I\): index of the coin I pick, \(X_m\), the \(m\)-th flip
The notation \(X_m | I \sim {\text{Bern}}(I/K)\) means..
- The conditional PMF of the random variable \(X_m\) given \(I\) is \({\text{Bern}}(I/K)\)
  - I.e. for any \(x \in {\mathbf{R}}\), \(I \in \{1, 2, \dots, K\}\), the following conditional probability is obtained by looking up the picture on the right
  - Or, mathematically

\[\begin{align}{\mathbb{P}}(X_m = x | I = i) = \begin{cases} i/K, & \text{if } x = 1 \\ 1 - i/K, & \text{if } x = 0 \\ 0, & \text{otherwise} \end{cases} \end{align}\]

Recall: warm-up calculation

Model
- Step 1: Pick one of the \(K+1 = 3\) coins from the bucket
- Step 2: Repeatedly flip the same coin
Query
- If the first \(n=3\) flips all return heads, what is the posterior probability that the standard (\(p = 1/2\)) coin was picked?

drawing

Recall: approach for warm-up calculation

Strategy (Bayes’ rule):
1. Attack the slightly more general problem \(\pi(i) = {\mathbb{P}}(C_i | H_1, H_2, H_3)\)
2. By definition of conditioning \[\pi(i) = \frac{{\mathbb{P}}(C_i, H_1, H_2, H_3)}{{\mathbb{P}}(H_1, H_2, H_3)}\] let us call the numerator \(\gamma(i)\) and the denominator, \(Z\), i.e. \(\pi(i) = \frac{\gamma(i)}{Z}\)
3. Start by computing \(\gamma(0), \gamma(1), \gamma(2)\)
  - Use chain rule (multiply edge conditional probabilities)
4. Normalize: compute \(Z = \gamma(0) + \gamma(1) + \gamma(2)\) and divide the vector \(\gamma\) by \(Z\):
Mathematical expression for the above algorithm (Bayes’ rule): if \(i \in \{0, 1, \dots, K\}\) indexes a partition \(C_i\) of \(S\), then the posterior distribution, \(\pi(i) = {\mathbb{P}}(C_i | D)\), is proportional to the joint distribution \(\gamma(i) = {\mathbb{P}}(C_i, D)\) in the sense that

\[ \pi(i) = \frac{\gamma(i)}{\sum_{i=0}^K \gamma(i)} \]

Resulting belief updates

drawing

First lecture: if you see “Heads, Heads, Heads” updated beliefs: \(\pi = (0, 1/9, 8/9)\)
Exercise: if you see “Heads, Tails, Tails” or any other mix of the two: \(\pi = (0, 1, 0)\)
By symmetry: if you see “Tails, Tails, Tails”: \(\pi = (8/9, 1/9, 0)\)

More on decision trees: drawing your own

drawing

Recall we computed each \(\gamma(i)\) by multiplying conditional probabilities on the edges of a path in a decision tree. I gave you the decision tree, how to do draw your own decision tree from scratch?

Idea:

write down the model as we did before \[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\} \end{align}\]
Each chunk of the form \(Y | Z\) tells you that you should make the decisions for random variable \(Y\) after those for random variable \(Z\).

Rationale: mathematically, chain rule works in any order (\({\mathbb{P}}(A, B) = {\mathbb{P}}(A){\mathbb{P}}(B|A) = {\mathbb{P}}(B){\mathbb{P}}(A|B)\)), but here we want to pick an order such that all the conditional distributions are known (part of the model specification)

Example: for the coin bucket model, this tells us we should draw \(I\) first, then the \(X_m\)’s (and that the order among the \(X\)’s does not matter).

Large decision trees: graphical models

drawing

Question: how to do this for a large model? For example: \[\begin{align*} X_1 &\sim {\text{Unif}}(0, 1/2, 1) \\ X_2 | X_1 &\sim {\text{Bern}}(X_1) \\ X_3 | X_1, X_4 &\sim {\text{Unif}}\{1, X_1, X_4\} \\ X_4 &\sim {\text{Bern}}(1/3) \\ X_5 | X_3 &\sim {\text{Unif}}\{0, .., X_3\} \\ X_7 &\sim {\text{Bern}}(1/4) \\ X_6 | X_4, X_7 &\sim {\text{Unif}}\{X_4, X_7\} \\ X_8 &\sim {\text{Bern}}(1/4) \end{align*}\]

Idea:

Build a graph, where nodes are random variables and edges are dependencies coming from the given conditional distributions. If you see \(X | Y\), add an edge \(Y \to X\).
Use an ordering of the nodes that respects the directionality of all the edges.

Terminology: this graph is called a (directed) graphical model or Bayes net

Generating synthetic datasets

Graphical models also provide a way to generate synthetic datasets.
Simply follow the same ordering as discussed in the previous slide
- but this time sample realizations of each (conditional) distribution.
- Much cheaper than decision trees! Linear complexity instead of exponential.
Useful for:
- benchmarking as the synthetic dataset hence created contains both observation and the “unknowns”
- defining important concepts such as calibration and optimality (soon)

Recall: point estimate (summaries) for PMFs and posterior PMF

Posterior PMF of a random variable \(X\) given observation \(D\) defined as you would expect, \(\pi(x) = {\mathbb{P}}(X = x | D)\)
Usual suspects for providing point estimate summaries of a PMF \(p(x)\)
- Mode: a global optimum of the PMF, \(\text{argmax}_x p(x)\)
  - In the context of posterior distribution, called the maximum a posteriori (MAP)
- Mean: \(\sum_{x} x p(x)\)
  - In the context of posterior distribution, called the posterior mean

Point estimate: example

Random variable \(X\) with point masses at \(3, 4, 5, 10\) with respective probabilities \(1/5, 2/5, 1/5, 1/5\)
- Mode at \(4\)
- Mean is \((1/5)*3 + (2/5)*4 + (1/5)*5 + (1/5)*10 = 5.2\) denoted \({\mathbf{E}}[X] = 5.2\)

drawing

Point estimate: example

Coin bucket/rocket: posterior distribution on \(p\) given that 3 heads were observed:
- Posterior mean: \(0 + (1/2) (1/9) + (1) (8/9) = 17/18\)
- MAP: \(1\)

Measures of uncertainty

You are given a level, typically \(95\%\) or \(90\%\), denoted \(1-\alpha = 0.95\) or \(1-\alpha = 0.9\)
Ideally, we would want to find a set \(T\) such that \(\sum_{x \in T} p(x) = 1-\alpha\)
- For discrete distributions, there may not be a solution, but let us ignore that issue for now (suppose each probability is very small and replace \(=\) by \(\approx\) up to an error bounded by the largest individual probability)
- For continuous distributions that will not be an issue
There is still a problem: there are many solutions! Some of them might appear weird, e.g. include points of probability zero!
- Example: both sets shown in red contain \(75\%\) of the mass

drawing

High Density Intervals (HDI)

High Density “Intervals”: pick a solution with the least number of points \(|T|\)
To handle discrete models: \(\text{argmin}\{ |T| : \sum_{x \in T} p(x) \ge 1-\alpha\}\).
- Still not unique in some corner cases, but good enough
“Interval” in name is unfortunate as it is not actually always an interval
- Example: multimodal distribution
- Sometimes called high density region

Example: the left one below is the HDI because the number of points, \(|T|\) is smaller (\(5 < 9\))

drawing

Prediction

drawing

Today’s exercise: given you observe the same data (3 Heads in a row), what is your belief that the next draw is again a Heads?

\[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\} \end{align}\]

Prediction: using posterior probability as the updated belief

drawing

Good approach:

Add a random variable for the quantity you want to predict (next flip \(N\)).
Add one level in the decision tree.
Use Bayes rule.
- Twist: distinct paths lead to the same prediction
- Sum the probabilities of the paths you collapse (why can we do this?).
  - Two ways to get \((H_{1:3}, N = \text{H again})\): \(\gamma(1) = 1/48 + 1/3\)
  - Only one way to get \((H_{1:3}, N = \text{T})\): \(\gamma(0) = 1/48\)
- As before, normalize \(\gamma\) to get \(\pi\): \[{\mathbb{P}}(N = \text{H again}|H_{1:3}) = \frac{\gamma(1)}{\gamma(0) + \gamma(1)} = 17/18\]

Prediction example: all cases

Let now \(\pi\) denote a vector where the first component is \(\P(N = \text{Tails}|H_{1:3})\) and the second, \(\P(N = \text{Heads}|H_{1:3})\).

If you see “Heads, Heads, Heads” updated beliefs: \(\pi = (1/18, 17/18)\)
- Interpretation: “I predict another Heads with 94% confidence”
Exercise: if you see “Heads, Tails, Tails” or any other mix of the two: \(\pi = (1/2, 1/2)\)
- Interpretation: “I do not lean on either predictions” (if forced, flip your own coin!)
By symmetry: if you see “Tails, Tails, Tails”: \(\pi = (17/18, 1/18)\)
- Interpretation: “I predict another Tails with 94% confidence”

Law of total probability

The process of adding the probabilities of the paths is called the law of total probability, mathematically, if \(E_i\) is a partition of the sample space, then for any event \(A\), \[{\mathbb{P}}(A) = \sum_i {\mathbb{P}}(A, E_i)\]

Why it matters: this provides a principled way to handle many data analysis issues such as missing data, prediction, censoring, etc. More example of that later.

Informally: “marginalize” over all the uncertain quantities. In particular, in the Bayesian framework, during prediction, the parameters can be better seen as “nuisance” parameters.

Prediction: “bad approach”

In the prediction exercise, some of you may have used the following argument:

Compute a point estimate on \(p\).
Then look at the probability of H and T for a Bernoulli with the point estimate plugged-in

Warning: in general this is a suboptimal way to update beliefs.

But: in this specific example, if you take the point estimate to be the posterior mean, you will find the same answer. This is not true in the vast majority of cases! Here this works because the Bernoulli has a single parameter, its mean. Try it for other models next week, you will see the plug-in method gives a different, sub-optimal answer.

Calibrated prediction uncertainty

Question: In what sense a belief update is “optimal”?

Consider the following experiment based on repeated synthetic data generation followed by prediction evaluation

set.seed(1)
rbern <- function(n, prob) { return(rbinom(n, 1, prob)) }
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max

# We will keep track for each synthetic dataset of
# whether we successfully predicted (we know truth because we are using synthetic)
success <- c()    
nIters <- 10000
for (iteration in seq(1:nIters)) {
  # Data generation
  i <- rdunif(3) - 1 # Uniform on {0, 1, 2}
  x <- rbern(3, i/2.0)
  nextThrow <- rbern(1, i/2.0)
  nHeads <- sum(x)
  # Record prediction statistics for the case of interest
  if (nHeads == 3) { 
    pred <- 1 
    success <- c( success, (if (pred == nextThrow) 1 else 0))
  }
}

# Plot evolution of the statistics across iterations
df <- data.frame("iteration" = c(1:length(success)), success)
df <- df %>%
  mutate(
    successRate = cumsum(success) / seq_along(success)
  )
ggplot(data=df, aes(x=iteration, y=successRate)) +
  geom_hline(yintercept=17/18, linetype="dashed", color = "black") + 
  geom_line()

DSCI 553 - Statistical Inference II: Bayesian Analysis

Goal today

Extra resources

Bayesian statistics: short historical overview

Recall: rocket problem

Recall: uncertainty estimates

Recall: Bayesian inference, high-level picture

Recall: intuition behind conditioning

Recall: definition of conditional probability \({\mathbb{P}}(A | E)\)

Recall: “constructing a probability model”

Better notation

Recall: warm-up calculation

Recall: approach for warm-up calculation

Resulting belief updates

More on decision trees: drawing your own

Large decision trees: graphical models

Generating synthetic datasets

Recall: point estimate (summaries) for PMFs and posterior PMF

Point estimate: example

Point estimate: example

Measures of uncertainty

High Density Intervals (HDI)

Prediction

Prediction: using posterior probability as the updated belief

Prediction example: all cases

Law of total probability

Prediction: “bad approach”

Calibrated prediction uncertainty

Calibration

New example: change point problem