Consistency, misspecification, identifiability

Alexandre Bouchard-Côté

Accuracy of point estimates as a function of number of observations

Poll: how many additional observations should you collect?

  1. 10 times more data
  2. 100 times more data
  3. 1000 times more data
  4. there is no way to answer this question

Back of the envelope calculation

\[\text{SD}(\delta) = \sqrt{{\text{Var}}\frac{1}{n} \sum_{i=1}^n X_i} = \sqrt{\frac{n {\text{Var}}X_1}{n^2}} = \frac{\text{constant}}{\sqrt{n}}\]

Example

drawing

Numerical exploration

posterior_distribution <- function(prior_probs, n_successes, n_trials){         
  K <- length(prior_probs)-1 # K+1 values that your p can assume
  n_fails <- n_trials - n_successes
  p <- seq(0, 1, 1/K)
  posterior_probs <-                                       # 1. this computes gamma(i)
    prior_probs *                                          #    - prior
    p^n_successes * (1-p)^n_fails                          #    - likelihood
  posterior_probs <- posterior_probs/sum(posterior_probs)  # 2. normalize gamma(i)
  post_prob <- rbind(p, posterior_probs)
  return(post_prob)
}

posterior_mean <- function(posterior_probs){
  return(sum(posterior_probs[1,] * posterior_probs[2,]))
}
set.seed(1)
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max
K = 1000

# Later, we will use different prior to generate data versus do inference, 
# but for now, use the same (uniform on the K+1 grid points)
prior_used_for_computing_posterior <- rep(1/(K+1), K+1)

simulate_posterior_mean_error <- function(n_datapoints) {
  i <- rdunif(K + 1) - 1  # Always generate the data using a uniform prior
  true_p <- i/K
  x <- rbinom(1, n_datapoints, true_p) # Recall: a Binomial is just a sum of iid Bernoullis
  post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
  post_mean <- posterior_mean(post)
  return(abs(post_mean - true_p)) # Return the error by comparing posterior mean and p used to generate data
}

Accuracy of point estimate: empirical results

set.seed(1)
df <- data.frame("n_observations" = rep(c(10, 32, 100, 316, 1000), each=1000))
df$error_well_specified <- sapply(df$n_observations, simulate_posterior_mean_error)

ggplot(data=df, aes(x=n_observations, y=error_well_specified+1e-5)) +
  stat_summary(fun.y = mean, geom="line") + # Line averages over 1000 datasets
  scale_x_log10() +  # Show result in log-log scale
  scale_y_log10() +
  geom_point()

Poll: what is the slope in this graph

  1. 2
  2. 1/2
  3. 1
  4. -1/2
  5. -2

Accuracy of point estimate as a function of number of observations

\(\underbrace{\log_{10}(\text{error})}_{y} = a \underbrace{\log_{10}(\text{number of observation})}_{x} + b\)

we get \(\text{error} = O((\text{number of observations})^{-1/2})\)

What if we used the “wrong” prior

prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)
plot(seq(0, 1, by=1/K), prior_used_for_computing_posterior)

set.seed(2)
df$error_mis_specified <- sapply(df$n_observations, simulate_posterior_mean_error)
df <- gather(data=df, key=prior_type, value=error, error_mis_specified, error_well_specified)

ggplot(data=df, aes(x=n_observations, y=error+1e-5, colour=factor(prior_type))) +
  stat_summary(fun.y = mean, geom="line") +
  scale_x_log10() + 
  scale_y_log10() +
  geom_point()

First way to break consistency: violating Cromwell’s rule

Second way to break consistency: Unidentifiable models

model Unidentifiable {
  ...
  laws {
    p1 ~ ContinuousUniform(0, 1)
    p2 ~ ContinuousUniform(0, 1)
    nFails | p1, p2, nLaunches ~ Binomial(nLaunches, p1 * p2)
  }
}

drawing

Theoretical justification of consistency: Doob’s consistency theorem

Theorem (Doob’s consistency): if an exchangeable model is identifiable, Bayesian consistency holds for this model.

Corollary: if you use the wrong prior, \(\tilde \pi_0\), but \(\tilde \pi_0(A) > 0 \Longleftrightarrow \pi_0(A) > 0\), then Doob’s consistency still hold for identifiable exchangeable models.

Note: the same holds true for any point estimate based on the posterior distribution (if you get the posterior right, you will get right anything that’s derived from it)