STAT 520 - Bayesian Analysis

Alexandre Bouchard-Côté

3/4/2019

Goal today

Logistics

Recall: motivation

Typical statistics course: a bunch of recipes.

Bayesian statistics: a philosophy applicable to pretty much any data analysis problem

Recall: some imperfect but useful terminology

drawing

Recall: Accuracy of point estimate as a function of number of observations?

set.seed(1)
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max
K = 1000

# Later, we will use different prior to generate data versus do inference, 
# but for now, use the same (uniform on the K+1 grid points)
prior_used_for_computing_posterior <- rep(1/(K+1), K+1)

simulate_posterior_mean_error <- function(n_datapoints) {
  i <- rdunif(K + 1) - 1  # Always generate the data using a uniform prior
  true_p <- i/K
  x <- rbinom(1, n_datapoints, true_p) # Recall: a Binomial is just a sum of iid Bernoullis
  post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
  post_mean <- posterior_mean(post)
  return(abs(post_mean - true_p)) # Return the error by comparing posterior mean and p used to generate data
}

Recall: Accuracy of point estimate, empirical results

set.seed(1)
df <- data.frame("n_observations" = rep(c(10, 32, 100, 316, 1000), each=1000))
df$error_well_specified <- sapply(df$n_observations, simulate_posterior_mean_error)

ggplot(data=df, aes(x=n_observations, y=error_well_specified+1e-5)) +
  stat_summary(fun.y = mean, geom="line") + # Line averages over 1000 datasets
  scale_x_log10() +  # Show result in log-log scale
  scale_y_log10() +
  geom_point()

Recall: Accuracy of point estimate as a function of number of observations

\(\underbrace{\log_{10}(\text{error})}_{y} = a \underbrace{\log_{10}(\text{number of observation})}_{x} + b\)

we get \(\text{error} = O((\text{number of observations})^{-1/2})\)

Recall: What if we used the “wrong” prior

prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)
plot(seq(0, 1, by=1/K), prior_used_for_computing_posterior)

set.seed(2)
df$error_mis_specified <- sapply(df$n_observations, simulate_posterior_mean_error)
df <- gather(data=df, key=prior_type, value=error, error_mis_specified, error_well_specified)

ggplot(data=df, aes(x=n_observations, y=error+1e-5, colour=factor(prior_type))) +
  stat_summary(fun.y = mean, geom="line") +
  scale_x_log10() + 
  scale_y_log10() +
  geom_point()

Recall: First way to break consistency: violating Cromwell’s rule

drawing

Second way to break consistency: Unidentifiable models

model Unidentifiable {
  ...
  laws {
    p1 ~ ContinuousUniform(0, 1)
    p2 ~ ContinuousUniform(0, 1)
    nFails | p1, p2, nLaunches ~ Binomial(nLaunches, p1 * p2)
  }
}

drawing

Theoretical justification of consistency: Doob’s consistency theorem

Theorem (Doob’s consistency): if an exchangeable model is identifiable, Bayesian consistency holds for this model.

Corollary: if you use the wrong prior, \(\tilde \pi_0\), but \(\tilde \pi_0(A) > 0 \Longleftrightarrow \pi_0(A) > 0\), then Doob’s consistency still hold for identifiable exchangeable models.

Note: the same holds true for any point estimate based on the posterior distribution (if you get the posterior right, you will get right anything that’s derived from it)

Exercise debrief: Bayes rule

drawing

\[ \pi(i) = \frac{\gamma(i)}{\sum_{i=0}^K \gamma(i)} \]

posterior_distribution <- function(prior_probs, n_successes, n_trials){         
  K <- length(prior_probs)-1 # K+1 values that your p can assume
  n_fails <- n_trials - n_successes
  p <- seq(0, 1, 1/K)
  posterior_probs <-                                       # 1. this computes gamma(i)
    prior_probs *                                          #    - pr. for picking coin (prior) (=1/3 in picture)
    p^n_successes * (1-p)^n_fails                          #    - conditional for the flips 
  posterior_probs <- posterior_probs/sum(posterior_probs)  # 2. normalize gamma(i)
  post_prob <- rbind(p, posterior_probs)
  return(post_prob)
}

Exercise debrief: point estimates

posterior_mean <- function(posterior_probs){
  return(sum(posterior_probs[1,] * posterior_probs[2,]))
}

Exercise debrief: construction of HDIs

drawing

high_density_intervals <- function(alpha, posterior_probs){
  ordered_probs = posterior_probs[,order(posterior_probs[2,], decreasing = TRUE)]
  cumulative_probs = cumsum(ordered_probs[2,])
  index = which.max(cumulative_probs >= (1-alpha))
  return(ordered_probs[,1:index, drop=FALSE])
}

Exercise debrief: calibration of credible intervals

set.seed(1)
K <- 1000
prior_used_for_computing_posterior <- rep(1/(K+1), K+1) # Use same prior for simulation and posterior computation
n_repeats <- 1000
alpha <- 0.1

hdi_coverage_pr <- function(n_datapoints) {
  n_inclusions <- 0
  for (repetition in seq(1:n_repeats)) {
    i <- rdunif(K + 1) - 1  # Always generate the data using a uniform prior
    true_p <- i/K
    x <- rbinom(1, n_datapoints, true_p)
    post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
    
    # This if is just a hacky way to check if true parameter is in the HDI credible interval
    if (sum(abs(true_p - high_density_intervals(alpha, post)[1,]) < 10e-10) == 1) {
      n_inclusions <- n_inclusions + 1
    }
  }
  return(n_inclusions/n_repeats) # Fraction of simulation where the true parameter was in interval
}

df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)

ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
  ylim(0, 1) +
  geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
  geom_line()

Why well-specified credible intervals are calibrated for any fixed dataset sizes

drawing

Calibration of credible intervals (misspecified case)

# Using now the same non-uniform prior as before for posterior calculation
prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)

set.seed(1)
K <- 1000
n_repeats <- 1000
alpha <- 0.1

df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)

ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
  ylim(0, 1) +
  geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
  geom_line()

Asymptotic calibration under certain misspecification

drawing

Bernstein-von Mises: under certain conditions, even when the prior is misspecified the actual coverage of credible intervals converges to the nominal coverage. (converge just means “gets arbitrarily close”)

Conditions include, but are not limited to:

  1. \(\theta \in \Theta\) needs to live in a continuous rather than discrete space! I.e.: \(\Theta \subset {\mathbf{R}}^d\)
    • Intuition: Bernstein-von Mises actually proves something stronger: convergence of the rescaled, centered posterior to a normal distribution, which is a continuous distribution (can you guess the scaling?)
    • “Bayesian Central Limit Theorem”
  2. \(p(\theta)\) (the prior) is now a density instead of a PMF,
    • we assume it is non-zero in a neighborhood of the true parameter
  3. we assume the posterior mean is consistent
    • recall: that means the error goes to zero as number of data points increases
  4. we assume the model is parametric:
    • \(\theta\) is allowed to be a vector, but you are not allowed to make it bigger as the number of observations increases

We will explore these conditions in more detail.

Survey: who is familiar with asymptotic normality of the MLE?

Practical consequence:

Continuous random variables: motivation

Continuous random variables: first implementation

Recall: how to use a density to compute a probability?

drawing

\[{\mathbb{P}}(X \in [0.1, 0.3]) = \int_{0.1}^{0.3} f(x) {\text{d}}x\] drawing

Monte Carlo as a black box

Motivation:

However, for simplicity, let us start by assuming the model has only 1 real valued unknown parameter.

Idea:

\[{\mathbb{P}}(X \in [0.1, 0.3]) \approx \frac{\# \text{ points in } S \text{ that fall in }[0.1, 0.3]}{M}\]

Intuition. Let’s say you want to compute the area of this weird curved shape \(A\):

drawing

How to create Monte Carlo samples \(S\)

Guiding principle: we have two ways to compute probabilities: densities and Monte Carlo. We want that as the size of \(S\) increases, the discrepancy between the two methods should become arbitrary close to zero

Recap: computing a mean

drawing

\[{\mathbf{E}}[X] = \int x f(x) {\text{d}}x\]

\[{\mathbf{E}}[X] \approx \frac{\sum_{i=1}^M X_i}{M}\]

\[{\mathbf{E}}[X^2] \approx \frac{\sum_{i=1}^M X^2_i}{M}\]

\[{\mathbf{E}}[g(X)] \approx \frac{\sum_{i=1}^M g(X_i)}{M}\]

More asymptotic regimes

We have done two thought experiments so far:

drawing

Note:

drawing

De Finetti: intuition and motivation

drawing

drawing

Exchangeable random variables

De Finetti: formal statement

Theorem: if \((X_1, X_2, \dots)\) are exchangeable Bernoulli random variables, then there exists a \(\theta\), a prior on \(\theta\), and a likelihood such that the observations are iid given \(\theta\).