Alexandre Bouchard-Côté
SD(δ)=√Var1nn∑i=1Xi=√nVarX1n2=constant√n
posterior_distribution <- function(prior_probs, n_successes, n_trials){
K <- length(prior_probs)-1 # K+1 values that your p can assume
n_fails <- n_trials - n_successes
p <- seq(0, 1, 1/K)
posterior_probs <- # 1. this computes gamma(i)
prior_probs * # - prior
p^n_successes * (1-p)^n_fails # - likelihood
posterior_probs <- posterior_probs/sum(posterior_probs) # 2. normalize gamma(i)
post_prob <- rbind(p, posterior_probs)
return(post_prob)
}
posterior_mean <- function(posterior_probs){
return(sum(posterior_probs[1,] * posterior_probs[2,]))
}
set.seed(1)
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max
K = 1000
# Later, we will use different prior to generate data versus do inference,
# but for now, use the same (uniform on the K+1 grid points)
prior_used_for_computing_posterior <- rep(1/(K+1), K+1)
simulate_posterior_mean_error <- function(n_datapoints) {
i <- rdunif(K + 1) - 1 # Always generate the data using a uniform prior
true_p <- i/K
x <- rbinom(1, n_datapoints, true_p) # Recall: a Binomial is just a sum of iid Bernoullis
post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
post_mean <- posterior_mean(post)
return(abs(post_mean - true_p)) # Return the error by comparing posterior mean and p used to generate data
}
set.seed(1)
df <- data.frame("n_observations" = rep(c(10, 32, 100, 316, 1000), each=1000))
df$error_well_specified <- sapply(df$n_observations, simulate_posterior_mean_error)
ggplot(data=df, aes(x=n_observations, y=error_well_specified+1e-5)) +
stat_summary(fun.y = mean, geom="line") + # Line averages over 1000 datasets
scale_x_log10() + # Show result in log-log scale
scale_y_log10() +
geom_point()
Result suggests a linear fit in the log-log scale log10(error)⏟y=alog10(number of observation)⏟x+b
Eyeball the coefficient a=Δx/Δy
Get a≈−1/2
Hence taking power on each side of
log10(error)⏟y=alog10(number of observation)⏟x+b
we get error=O((number of observations)−1/2)
Similarly to the frequentist setting, for many models, root mean square error will scale like constant√number of observations
To gain 2 more significant digits: need 10,000x more observations, etc
What we did here is an example of an (empirical) asymptotic analysis
prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)
plot(seq(0, 1, by=1/K), prior_used_for_computing_posterior)
set.seed(2)
df$error_mis_specified <- sapply(df$n_observations, simulate_posterior_mean_error)
df <- gather(data=df, key=prior_type, value=error, error_mis_specified, error_well_specified)
ggplot(data=df, aes(x=n_observations, y=error+1e-5, colour=factor(prior_type))) +
stat_summary(fun.y = mean, geom="line") +
scale_x_log10() +
scale_y_log10() +
geom_point()
Suppose we put zero prior mass to all p smaller than 1/2. What happens on the posterior?
If we put prior mass of zero to some event, say π0(A)=0 then the posterior probability of that event will always be zero, πn(A)=0, no matter how many observations n we get
Cromwell’s rule (Oliver Cromwell, 1650): “I beseech you, in the bowels of Christ, think it possible that you may be mistaken.”
When there are too many parameters, you may not be able to recover them even if you had “infinite data”
Model code in the Blang PPL
model Unidentifiable {
...
laws {
p1 ~ ContinuousUniform(0, 1)
p2 ~ ContinuousUniform(0, 1)
nFails | p1, p2, nLaunches ~ Binomial(nLaunches, p1 * p2)
}
}
In the exchangeable setup, outside of these two failure modes, we are guaranteed to have consistency. The details of the statement of the theorem are somewhat delicate and need to be understood to correctly understand the result.
Theorem (Doob’s consistency): if an exchangeable model is identifiable, Bayesian consistency holds for this model.
Corollary: if you use the wrong prior, ˜π0, but ˜π0(A)>0⟺π0(A)>0, then Doob’s consistency still hold for identifiable exchangeable models.
Note: the same holds true for any point estimate based on the posterior distribution (if you get the posterior right, you will get right anything that’s derived from it)
Space, Right Arrow or swipe left to move to next slide, click help below for more details