DSCI 553 - Statistical Inference II: Bayesian Analysis

Alexandre Bouchard-Côté

2/7/2019

Goal today

Illustrate different issues and techniques that arise in Bayesian analysis using our rocket/coin flip model.

Again, we will cover more advanced examples next week using PPLs.

Extra resources

Bayesian statistics: short historical overview

Recall: rocket problem

drawing

Recall: uncertainty estimates

Recall: Bayesian inference, high-level picture

  1. Construct a probability model including
    • random variables for what we will measure/observe
    • random variables for the unknown quantities (or latent)
      • those we are interested in (“parameters”, “predictions”)
      • others that just help us formulate the problem (“nuisance”, “random effects”). Ignore this for now.
  2. Compute the posterior distribution conditionally on the actual data at hand
  3. Use the posterior distribution to:
    • make prediction (point estimate)
    • estimate uncertainty (credible intervals)
    • make a decision (more on this later)

Recall: intuition behind conditioning

drawing

Recall: definition of conditional probability \({\mathbb{P}}(A | E)\)

drawing

Recall: “constructing a probability model”

Better notation

Using our slick probabilist’s notation introduce at the end of last lecture, our model is:

\[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\} \end{align}\]

drawing

Recall:

\[\begin{align}{\mathbb{P}}(X_m = x | I = i) = \begin{cases} i/K, & \text{if } x = 1 \\ 1 - i/K, & \text{if } x = 0 \\ 0, & \text{otherwise} \end{cases} \end{align}\]

Recall: warm-up calculation

drawing

Recall: approach for warm-up calculation

\[ \pi(i) = \frac{\gamma(i)}{\sum_{i=0}^K \gamma(i)} \]

Resulting belief updates

drawing

More on decision trees: drawing your own

drawing

Recall we computed each \(\gamma(i)\) by multiplying conditional probabilities on the edges of a path in a decision tree. I gave you the decision tree, how to do draw your own decision tree from scratch?

Idea:

  1. write down the model as we did before \[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\} \end{align}\]
  2. Each chunk of the form \(Y | Z\) tells you that you should make the decisions for random variable \(Y\) after those for random variable \(Z\).

Rationale: mathematically, chain rule works in any order (\({\mathbb{P}}(A, B) = {\mathbb{P}}(A){\mathbb{P}}(B|A) = {\mathbb{P}}(B){\mathbb{P}}(A|B)\)), but here we want to pick an order such that all the conditional distributions are known (part of the model specification)

Example: for the coin bucket model, this tells us we should draw \(I\) first, then the \(X_m\)’s (and that the order among the \(X\)’s does not matter).

Large decision trees: graphical models

drawing

Question: how to do this for a large model? For example: \[\begin{align*} X_1 &\sim {\text{Unif}}(0, 1/2, 1) \\ X_2 | X_1 &\sim {\text{Bern}}(X_1) \\ X_3 | X_1, X_4 &\sim {\text{Unif}}\{1, X_1, X_4\} \\ X_4 &\sim {\text{Bern}}(1/3) \\ X_5 | X_3 &\sim {\text{Unif}}\{0, .., X_3\} \\ X_7 &\sim {\text{Bern}}(1/4) \\ X_6 | X_4, X_7 &\sim {\text{Unif}}\{X_4, X_7\} \\ X_8 &\sim {\text{Bern}}(1/4) \end{align*}\]

Idea:

  1. Build a graph, where nodes are random variables and edges are dependencies coming from the given conditional distributions. If you see \(X | Y\), add an edge \(Y \to X\).
  2. Use an ordering of the nodes that respects the directionality of all the edges.

Terminology: this graph is called a (directed) graphical model or Bayes net

Generating synthetic datasets

Recall: point estimate (summaries) for PMFs and posterior PMF

Point estimate: example

drawing

Point estimate: example

Measures of uncertainty

drawing

High Density Intervals (HDI)

Example: the left one below is the HDI because the number of points, \(|T|\) is smaller (\(5 < 9\))

drawing

Prediction

drawing

Today’s exercise: given you observe the same data (3 Heads in a row), what is your belief that the next draw is again a Heads?

\[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\} \end{align}\]

Prediction: using posterior probability as the updated belief

drawing

Good approach:

  1. Add a random variable for the quantity you want to predict (next flip \(N\)).
  2. Add one level in the decision tree.
  3. Use Bayes rule.
    • Twist: distinct paths lead to the same prediction
    • Sum the probabilities of the paths you collapse (why can we do this?).
      • Two ways to get \((H_{1:3}, N = \text{H again})\): \(\gamma(1) = 1/48 + 1/3\)
      • Only one way to get \((H_{1:3}, N = \text{T})\): \(\gamma(0) = 1/48\)
    • As before, normalize \(\gamma\) to get \(\pi\): \[{\mathbb{P}}(N = \text{H again}|H_{1:3}) = \frac{\gamma(1)}{\gamma(0) + \gamma(1)} = 17/18\]

Prediction example: all cases

Let now \(\pi\) denote a vector where the first component is \(\P(N = \text{Tails}|H_{1:3})\) and the second, \(\P(N = \text{Heads}|H_{1:3})\).

Law of total probability

The process of adding the probabilities of the paths is called the law of total probability, mathematically, if \(E_i\) is a partition of the sample space, then for any event \(A\), \[{\mathbb{P}}(A) = \sum_i {\mathbb{P}}(A, E_i)\]

Why it matters: this provides a principled way to handle many data analysis issues such as missing data, prediction, censoring, etc. More example of that later.

Informally: “marginalize” over all the uncertain quantities. In particular, in the Bayesian framework, during prediction, the parameters can be better seen as “nuisance” parameters.

Prediction: “bad approach”

In the prediction exercise, some of you may have used the following argument:

  1. Compute a point estimate on \(p\).
  2. Then look at the probability of H and T for a Bernoulli with the point estimate plugged-in

Warning: in general this is a suboptimal way to update beliefs.

But: in this specific example, if you take the point estimate to be the posterior mean, you will find the same answer. This is not true in the vast majority of cases! Here this works because the Bernoulli has a single parameter, its mean. Try it for other models next week, you will see the plug-in method gives a different, sub-optimal answer.

Calibrated prediction uncertainty

Question: In what sense a belief update is “optimal”?

Consider the following experiment based on repeated synthetic data generation followed by prediction evaluation

set.seed(1)
rbern <- function(n, prob) { return(rbinom(n, 1, prob)) }
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max

# We will keep track for each synthetic dataset of
# whether we successfully predicted (we know truth because we are using synthetic)
success <- c()    
nIters <- 10000
for (iteration in seq(1:nIters)) {
  # Data generation
  i <- rdunif(3) - 1 # Uniform on {0, 1, 2}
  x <- rbern(3, i/2.0)
  nextThrow <- rbern(1, i/2.0)
  nHeads <- sum(x)
  # Record prediction statistics for the case of interest
  if (nHeads == 3) { 
    pred <- 1 
    success <- c( success, (if (pred == nextThrow) 1 else 0))
  }
}

# Plot evolution of the statistics across iterations
df <- data.frame("iteration" = c(1:length(success)), success)
df <- df %>%
  mutate(
    successRate = cumsum(success) / seq_along(success)
  )
ggplot(data=df, aes(x=iteration, y=successRate)) +
  geom_hline(yintercept=17/18, linetype="dashed", color = "black") + 
  geom_line()

Calibration

New example: change point problem

Moved to second week