STAT 520 - Bayesian Analysis

Alexandre Bouchard-Côté

2/25/2019

Organization

This week: a Bayesian “bootcamp”. People taking this course typically have a wide range of background. This will make sure we all have a common foundation.

Rest of the course: selected topics from…

Logistics

Bayesian analysis: pros and cons

Bayesian analysis: some real examples

drawing

This week’s example

drawing

Paradox?

Uncertainty estimates

Uncertainty will not go away

drawing

Bayesian inference: high-level picture

  1. Construct a probability model including
    • random variables for what we will measure/observe
    • random variables for the unknown quantities
      • those we are interested in (“parameters”, “predictions”)
      • others that just help us formulate the problem (“nuisance”, “random effects”).
  2. Compute the posterior distribution conditionally on the actual data at hand
  3. Use the posterior distribution to:
    • make prediction (point estimate)
    • estimate uncertainty (credible intervals)
    • make a decision (more on this later)

Step 3 is unified into one framework: decision theory.

First part: “constructing a probability model”

Scientific models

A simplification of reality amenable to mathematical investigation.

\[\text{Reality} \xrightarrow{\text{Art + Scientific method}} \text{Model} \xrightarrow{\text{Mathematics}} \text{Prediction}\]

Towards a model: biased coins

drawing

Making \(p\) random

drawing

Simple probability model for the rocket launch problem

drawing

Warm-up calculation / probability diagnostic

drawing

Diagnostic (part 2): given you observe the same data (3 Heads in a row), what is your belief that the next draw is again a Heads?

Exercise 1, question 1: write a function taking as input:

and return as output an array of posterior probabilities.

Recap: some probability vocabulary

drawing

Recap: axioms/constraints of probability

drawing

Recap: conditioning

drawing

Recap: definition of conditional probability \({\mathbb{P}}(A | E)\)

drawing

drawing

We can now formalize the question

Labelling decision trees with (conditional) probabilities

drawing

Last step in the computation

  1. Attack the slightly more general problem \(\pi(i) = {\mathbb{P}}(C_i | H_1, H_2, H_3)\)
  2. By definition of conditioning \[\pi(i) = \frac{{\mathbb{P}}(C_i, H_1, H_2, H_3)}{{\mathbb{P}}(H_1, H_2, H_3)}\] let us call the numerator \(\gamma(i)\) and the denominator, \(Z\), i.e. \(\pi(i) = \frac{\gamma(i)}{Z}\)
  3. Start by computing \(\gamma(0), \gamma(1), \gamma(2)\) (using chain rule, see previous slide)
  4. Note \(Z = \gamma(0) + \gamma(1) + \gamma(2)\) (why?).
    • Hence can compute \(\pi\) by normalizing the vector \(\gamma\) into a vector that to sum to one.

Mathematical expression for the above algorithm (Bayes’ rule): if \(i \in \{0, 1, \dots, K\}\) indexes a partition \(C_i\) of \(S\), then the posterior distribution, \(\pi(i) = {\mathbb{P}}(C_i | D)\), is proportional to the joint distribution \(\gamma(i) = {\mathbb{P}}(C_i, D)\) in the sense that

\[ \pi(i) = \frac{\gamma(i)}{\sum_{i=0}^K \gamma(i)} \] Answer: \({\mathbb{P}}(C_1 | H_1, H_2, H_3) = 1/9\).

From posterior distributions to point estimates

Recap: random variables

drawing

Probabilist’s notation

drawing

Example: coins

drawing

Example: discrete uniform distributions

drawing

Better notation

Using our slick probabilist’s notation introduce at the end of last lecture, our model is:

\[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\} \end{align}\]

drawing

Recall:

\[\begin{align}{\mathbb{P}}(X_m = x | I = i) = \begin{cases} i/K, & \text{if } x = 1 \\ 1 - i/K, & \text{if } x = 0 \\ 0, & \text{otherwise} \end{cases} \end{align}\]

Point estimate (summaries) for PMFs and posterior PMF

More on decision trees: drawing your own

drawing

Recall we computed each \(\gamma(i)\) by multiplying conditional probabilities on the edges of a path in a decision tree. I gave you the decision tree, how to do draw your own decision tree from scratch?

Idea:

  1. write down the model as we did before \[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\} \end{align}\]
  2. Each chunk of the form \(Y | Z\) tells you that you should make the decisions for random variable \(Y\) after those for random variable \(Z\).

Rationale: mathematically, chain rule works in any order (\({\mathbb{P}}(A, B) = {\mathbb{P}}(A){\mathbb{P}}(B|A) = {\mathbb{P}}(B){\mathbb{P}}(A|B)\)), but here we want to pick an order such that all the conditional distributions are known (part of the model specification)

Example: for the coin bucket model, this tells us we should draw \(I\) first, then the \(X_m\)’s (and that the order among the \(X\)’s does not matter).

Large decision trees: graphical models

drawing

Question: how to do this for a large model? For example: \[\begin{align*} X_1 &\sim {\text{Unif}}(0, 1/2, 1) \\ X_2 | X_1 &\sim {\text{Bern}}(X_1) \\ X_3 | X_1, X_4 &\sim {\text{Unif}}\{1, X_1, X_4\} \\ X_4 &\sim {\text{Bern}}(1/3) \\ X_5 | X_3 &\sim {\text{Unif}}\{0, .., X_3\} \\ X_7 &\sim {\text{Bern}}(1/4) \\ X_6 | X_4, X_7 &\sim {\text{Unif}}\{X_4, X_7\} \\ X_8 &\sim {\text{Bern}}(1/4) \end{align*}\]

Idea:

  1. Build a graph, where nodes are random variables and edges are dependencies coming from the given conditional distributions. If you see \(X | Y\), add an edge \(Y \to X\).
  2. Use an ordering of the nodes that respects the directionality of all the edges.

Terminology: this graph is called a (directed) graphical model or Bayes net

Generating synthetic datasets

Measures of uncertainty

drawing

High Density Intervals (HDI)

Example: the left one below is the HDI because the number of points, \(|T|\) is smaller (\(5 < 9\))

drawing

Exercise 1, question 2: implement a function taking as input a posterior PMF (as a vector), a level, and returns an HDI.

Prediction

drawing

Diagnostic (part 2): given you observe the same data (3 Heads in a row), what is your belief that the next draw is again a Heads?

\[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\} \end{align}\]

Prediction: using posterior probability as the updated belief

drawing

Good approach:

  1. Add a random variable for the quantity you want to predict (next flip \(N\)).
  2. Add one level in the decision tree.
  3. Use Bayes rule.
    • Twist: distinct paths lead to the same prediction
    • Sum the probabilities of the paths you collapse (why can we do this?).
      • Two ways to get \((H_{1:3}, N = \text{H again})\): \(\gamma(1) = 1/48 + 1/3\)
      • Only one way to get \((H_{1:3}, N = \text{T})\): \(\gamma(0) = 1/48\)
    • As before, normalize \(\gamma\) to get \(\pi\): \[{\mathbb{P}}(N = \text{H again}|H_{1:3}) = \frac{\gamma(1)}{\gamma(0) + \gamma(1)} = 17/18\]

Prediction example: all cases

Let now \(\pi\) denote a vector where the first component is \({\mathbb{P}}(N = \text{Tails}|H_{1:3})\) and the second, \({\mathbb{P}}(N = \text{Heads}|H_{1:3})\).

Recap: law of total probability

The process of adding the probabilities of the paths is called the law of total probability, mathematically, if \(E_i\) is a partition of the sample space, then for any event \(A\), \[{\mathbb{P}}(A) = \sum_i {\mathbb{P}}(A, E_i)\]

Why it matters: this provides a principled way to handle many data analysis issues such as missing data, prediction, censoring, etc. More example of that later.

Informally: “marginalize” over all the uncertain quantities. In particular, in the Bayesian framework, during prediction, the parameters can be better seen as “nuisance” parameters.

Prediction: “bad approach”

In the prediction exercise, some of you may have used the following argument:

  1. Compute a point estimate on \(p\).
  2. Then look at the probability of H and T for a Bernoulli with the point estimate plugged-in

Warning: in general this is a suboptimal way to update beliefs.

But: in this specific example, if you take the point estimate to be the posterior mean, you will find the same answer. This is not true in the vast majority of cases! Here this works because the Bernoulli has a single parameter, its mean. Try it for other models next week, you will see the plug-in method gives a different, sub-optimal answer.

Calibrated prediction uncertainty

Question: In what sense a belief update is “optimal”?

Consider the following experiment based on repeated synthetic data generation followed by prediction evaluation

set.seed(1)
rbern <- function(n, prob) { return(rbinom(n, 1, prob)) }
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max

# We will keep track for each synthetic dataset of
# whether we successfully predicted (we know truth because we are using synthetic)
success <- c()    
nIters <- 10000
for (iteration in seq(1:nIters)) {
  # Data generation
  i <- rdunif(3) - 1 # Uniform on {0, 1, 2}
  x <- rbern(3, i/2.0)
  nextThrow <- rbern(1, i/2.0)
  nHeads <- sum(x)
  # Record prediction statistics for the case of interest
  if (nHeads == 3) { 
    pred <- 1 
    success <- c( success, (if (pred == nextThrow) 1 else 0))
  }
}

# Plot evolution of the statistics across iterations
df <- data.frame("iteration" = c(1:length(success)), success)
df <- df %>%
  mutate(
    successRate = cumsum(success) / seq_along(success)
  )
ggplot(data=df, aes(x=iteration, y=successRate)) +
  geom_hline(yintercept=17/18, linetype="dashed", color = "black") + 
  geom_line()

Calibration

Exercise 1, question 3:

Exercise 1, optional question: apply your posterior inference method to real data. For example, the rocket launch data.

Exercise 1, optional question: a randomized algorithm can be viewed as a function that takes as input a special object which provides access to randomness. Let us assume this object only provides one function, called “bernoulli(p)”, to access iid Bernoullis. Write a function that takes an arbitrary randomized algorithm as output and prints out the decision tree.