STAT 520 - Bayesian Analysis

Alexandre Bouchard-Côté

2/27/2019

Goal today

Logistics

Bayesian statistics: short historical overview

Recap: toy model

drawing

Recap: decision tree, random variable and PMF all in one picture

drawing

Posterior PMF

drawing

Recap: probabilist’s notation

drawing

Mathematically: \((I < 2) = \{s : I(s) < 2\}\)

Recap: declaration of conditionals

\[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_n | I &\sim {\text{Bern}}(I/K); n \in \{1, 2, 3\} \end{align}\]

drawing

Recall:

\[\begin{align}{\mathbb{P}}(X_n = x | I = i) = \begin{cases} i/K, & \text{if } x = 1 \\ 1 - i/K, & \text{if } x = 0 \\ 0, & \text{otherwise} \end{cases} \end{align}\]

\[\begin{align} p &\sim {\text{Unif}}(0, 1) \\ X_n | p &\sim {\text{Bern}}(p); n \in \{1, 2, 3\} \end{align}\]

More on decision trees: drawing your own

drawing

Recall we computed each \(\gamma(i)\) by multiplying conditional probabilities on the edges of a path in a decision tree. I gave you the decision tree, how to do draw your own decision tree from scratch?

Idea:

  1. write down the model as we did before \[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_n | I &\sim {\text{Bern}}(I/K); n \in \{1, 2, 3\} \end{align}\]
  2. Each chunk of the form \(Y | Z\) tells you that you should make the decisions for random variable \(Y\) after those for random variable \(Z\).

Rationale: mathematically, chain rule works in any order (\({\mathbb{P}}(A, B) = {\mathbb{P}}(A){\mathbb{P}}(B|A) = {\mathbb{P}}(B){\mathbb{P}}(A|B)\)), but here we want to pick an order such that all the conditional distributions are known (part of the model specification)

Example: for the coin bucket model, this tells us we should draw \(I\) first, then the \(X_n\)’s (and that the order among the \(X\)’s does not matter).

Large decision trees: graphical models

drawing

Question: how to do this for a large model? For example: \[\begin{align*} X_1 &\sim {\text{Unif}}(0, 1/2, 1) \\ X_2 | X_1 &\sim {\text{Bern}}(X_1) \\ X_3 | X_1, X_4 &\sim {\text{Unif}}\{1, X_1, X_4\} \\ X_4 &\sim {\text{Bern}}(1/3) \\ X_5 | X_3 &\sim {\text{Unif}}\{0, .., X_3\} \\ X_7 &\sim {\text{Bern}}(1/4) \\ X_6 | X_4, X_7 &\sim {\text{Unif}}\{X_4, X_7\} \\ X_8 &\sim {\text{Bern}}(1/4) \end{align*}\]

Idea:

  1. Build a graph, where nodes are random variables and edges are dependencies coming from the given conditional distributions. If you see \(X | Y\), add an edge \(Y \to X\).
  2. Use an ordering of the nodes that respects the directionality of all the edges.

Terminology: this graph is called a (directed) graphical model or Bayes net

Generating synthetic datasets

Some imperfect but useful terminology

drawing

Prediction

drawing

Question: given you observe the same data (3 Heads in a row), what is your belief that the next draw is again a Heads?

\[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_n | I &\sim {\text{Bern}}(I/K); n \in \{1, 2, 3\} \end{align}\]

Prediction: the right approach

drawing

  1. Add a random variable for the quantity you want to predict (next flip \(N\)).
  2. Add one level in the decision tree.
  3. Use Bayes rule.
    • Twist: distinct paths lead to the same prediction
    • Sum the probabilities of the paths you collapse (why can we do this?).
      • Two ways to get \((H_{1:3}, N = \text{H again})\): \(\gamma(1) = 1/48 + 1/3\)
      • Only one way to get \((H_{1:3}, N = \text{T})\): \(\gamma(0) = 1/48\)
    • As before, normalize \(\gamma\) to get \(\pi\): \[{\mathbb{P}}(N = \text{H again}|H_{1:3}) = \frac{\gamma(1)}{\gamma(0) + \gamma(1)} = 17/18\]

Prediction

Let now \(\pi\) denote a vector where the first component is \({\mathbb{P}}(N = \text{Tails}|H_{1:3})\) and the second, \({\mathbb{P}}(N = \text{Heads}|H_{1:3})\).

Prediction: “bad approach”

In the prediction exercise, some of you may have used the following argument:

  1. Compute a point estimate on \(p\).
  2. Then look at the probability of H and T for a Bernoulli with the point estimate plugged-in

Warning: in general this could lead to a suboptimal way to update beliefs.

But: in this specific example, if you take the point estimate to be the posterior mean, you will find the same answer. This is not true in the vast majority of cases! Here this works because the Bernoulli has a single parameter, its mean. Try it for other models next week, you will see the plug-in method gives a different, sub-optimal answer.

Calibrated prediction uncertainty

Question: In what sense a belief update is “optimal”?

Consider the following experiment based on repeated synthetic data generation followed by prediction evaluation

set.seed(1)
rbern <- function(n, prob) { return(rbinom(n, 1, prob)) }
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max

# We will keep track for each synthetic dataset of
# whether we successfully predicted (we know truth because we are using synthetic)
success <- c()    
nIters <- 10000
for (iteration in seq(1:nIters)) {
  # Data generation
  i <- rdunif(3) - 1 # Uniform on {0, 1, 2}
  x <- rbern(3, i/2.0)
  nextThrow <- rbern(1, i/2.0)
  nHeads <- sum(x)
  # Record prediction statistics for the case of interest
  if (nHeads == 3) { 
    pred <- 1 
    success <- c( success, (if (pred == nextThrow) 1 else 0))
  }
}

# Plot evolution of the statistics across iterations
df <- data.frame("iteration" = c(1:length(success)), success)
df <- df %>%
  mutate(
    successRate = cumsum(success) / seq_along(success)
  )
ggplot(data=df, aes(x=iteration, y=successRate)) +
  geom_hline(yintercept=17/18, linetype="dashed", color = "black") + 
  geom_line()

Calibration

Exercise 1, question 3:

Exercise 1, optional question: apply your posterior inference method to real data. For example, the rocket launch data.

Exercise 1, optional question: a randomized algorithm can be viewed as a function that takes as input a special object which provides access to randomness. Let us assume this object only provides one function, called “bernoulli(p)”, to access iid Bernoullis. Write a function that takes an arbitrary randomized algorithm as output and prints out the decision tree.

Accuracy of point estimate as a function of number of observations?

set.seed(1)
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max
K = 1000

# Later, we will use different prior to generate data versus do inference, 
# but for now, use the same (uniform on the K+1 grid points)
prior_used_for_computing_posterior <- rep(1/(K+1), K+1)

simulate_posterior_mean_error <- function(n_datapoints) {
  i <- rdunif(K + 1) - 1  # Always generate the data using a uniform prior
  true_p <- i/K
  x <- rbinom(1, n_datapoints, true_p) # Recall: a Binomial is just a sum of iid Bernoullis
  post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
  post_mean <- posterior_mean(post)
  return(abs(post_mean - true_p)) # Return the error by comparing posterior mean and p used to generate data
}

Accuracy of point estimate: empirical results

set.seed(1)
df <- data.frame("n_observations" = rep(c(10, 32, 100, 316, 1000), each=1000))
df$error_well_specified <- sapply(df$n_observations, simulate_posterior_mean_error)

ggplot(data=df, aes(x=n_observations, y=error_well_specified+1e-5)) +
  stat_summary(fun.y = mean, geom="line") + # Line averages over 1000 datasets
  scale_x_log10() +  # Show result in log-log scale
  scale_y_log10() +
  geom_point()

Accuracy of point estimate as a function of number of observations

\(\underbrace{\log_{10}(\text{error})}_{y} = a \underbrace{\log_{10}(\text{number of observation})}_{x} + b\)

we get \(\text{error} = O((\text{number of observations})^{-1/2})\)

What if we used the “wrong” prior

prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)
plot(seq(0, 1, by=1/K), prior_used_for_computing_posterior)

set.seed(2)
df$error_mis_specified <- sapply(df$n_observations, simulate_posterior_mean_error)
df <- gather(data=df, key=prior_type, value=error, error_mis_specified, error_well_specified)

ggplot(data=df, aes(x=n_observations, y=error+1e-5, colour=factor(prior_type))) +
  stat_summary(fun.y = mean, geom="line") +
  scale_x_log10() + 
  scale_y_log10() +
  geom_point()

First way to break consistency: violating Cromwell’s rule

drawing

Lecture ended here

Second way to break consistency: Unidentifiable models

model Unidentifiable {
  ...
  laws {
    p1 ~ ContinuousUniform(0, 1)
    p2 ~ ContinuousUniform(0, 1)
    nFails | p1, p2, nLaunches ~ Binomial(nLaunches, p1 * p2)
  }
}

drawing

Theoretical justification of consistency: Doob’s consistency theorem

Theorem (Doob’s consistency): if an exchangeable model is identifiable, Bayesian consistency holds for this model.

Corollary: if you use the wrong prior, \(\tilde \pi_0\), but \(\tilde \pi_0(A) > 0 \Longleftrightarrow \pi_0(A) > 0\), then Doob’s consistency still hold for identifiable exchangeable models.

Note: the same holds true for any point estimate based on the posterior distribution (if you get the posterior right, you will get right anything that’s derived from it)

Continuous random variables: motivation

Continuous random variables: first implementation

Recall: how to use a density to compute a probability?

drawing

\[{\mathbb{P}}(X \in [0.1, 0.3]) = \int_{0.1}^{0.3} f(x) {\text{d}}x\]

Monte Carlo as a black box

Motivation:

Idea:

\[{\mathbb{P}}(X \in [0.1, 0.3]) \approx \frac{\# \text{ points in } S \text{ that fall in }[0.1, 0.3]}{M}\]

Intuition. Let’s say you want to compute the area of this weird curved shape \(A\):

drawing

How to create Monte Carlo samples \(S\)

Guiding principle: we have two ways to compute probabilities: densities and Monte Carlo. We want that as the size of \(S\) increases, the discrepancy between the two methods should become arbitrary close to zero

Recap: computing a mean

drawing

\[{\mathbf{E}}[X] = \int x f(x) {\text{d}}x\]

\[{\mathbf{E}}[X] \approx \frac{\sum_{i=1}^M X_i}{M}\]

\[{\mathbf{E}}[X^2] \approx \frac{\sum_{i=1}^M X^2_i}{M}\]

\[{\mathbf{E}}[g(X)] \approx \frac{\sum_{i=1}^M g(X_i)}{M}\]

More asymptotic regimes

We have done two thought experiments so far:

drawing

Note:

drawing