Bayes estimator: a first example
Alexandre Bouchard-Côté
Running example: insurance policy for a Delta 7925H

- You are consulting for a satellite operator
- They are about to send a $50M satellite on a Delta 7925H rocket
- Data: as of November 2020, Delta 7925H rockets have been launched 3 times, with 0 failed launches
- Note: Delta 7925H is not reusable, so each rocket is “copy” built from the same blueprint
- Should you recommend buying a $1M insurance policy?
Poll: Back of the envelope recommendation
- No, since estimated probability of failure is equal to \(0 = 0/3\)
- No, since estimated probability of failure is between 0 and some constant
- No, since estimated probability of failure is between some constant and one
- Yes, since estimated probability of failure is smaller than some constant
- Yes, since estimated probability of failure is greater than some constant
Bayes estimator: first overview
The Bayes estimator,
\[\color{blue}{{\textrm{argmin}}} \{ \color{red}{{\mathbf{E}}}[\color{blue}{L}(a, Z) \color{green}{| X}] : a \in {\mathcal{A}}\}\]
encodes a 3-step approach applicable to almost any statistical problems:
- \(\color{red}{\text{Construct a probability model}}\)
- \(\color{green}{\text{Compute or approximate the posterior distribution conditionally on the actual data at hand}}\)
- \(\color{blue}{\text{Solve an optimation problem to turn the posterior distribution into an "action"}}\)
Let us examplify this process to the Delta rocket insurance problem.
Step 1: Probability model
\[{\textrm{argmin}}\{ \color{red}{{\mathbf{E}}}[L(a, Z) | X] : a \in {\mathcal{A}}\}\]
A Bayesian model is a probability model equipped with a:
- random variable representing the data, \(X = (X_1, X_2, X_3)\) where
- \(X_i = 1\) means rocket launch \(i\) was a failure
- random variable for the unknown quantities, \(Z\)
- in our example, this is just the fourth launch \(Z = X_4\)
Concretely: a Bayesian model = a joint density \(\gamma(x, z)\)
Poll: What is the space of possible Bayesian models for the Delta rocket example?
- a single parameter, \(p \in [0, 1]\)
- a discrete space of cardinality \(2^4\)
- a list of \(2^4\) real numbers \(p_i \in [0, 1]\), which sum to one (note: fixed bug, first version wrote \(4\) instead of \(2^4\))
- all possible probability distributions
- none of the above
Choosing a Bayesian model
- The set of possible Bayesian models in the rocket example is equivalent to a list of \(2^4\) real numbers \(p_i \in [0, 1]\), which sum to one (note: fixed bug, first version wrote \(4\) instead of \(2^4\))
- The Bayesian analyst has to pick ONE point in this set
- (Sharp contrast with frequentist models)
- How to pick one model:
- With practice! This is a bit of an art form.
- Then, use the scientific method (more on this when we talk about model selection)
Two techniques for building Bayesian models
Technique one: introducing nuisance variables
Augment the joint distribution with more unknown quantities, often called “parameters”.
Example:
- add to \(\gamma(x, z)\) one more component: in our example, a real number \(p \in [0, 1]\),
- interpreted as a failure probability that will be shared by all launches.
- we now have to specify \(\gamma(x, \tilde z)\), where \(\tilde z = (z, p)\), did that really help?
- for now on, I will use \(z \gets \tilde z\) to simplify notation.
Two techniques for building Bayesian models
Technique two: impose “regularities”
Symmetries, (conditional) independences, identical distributions, factorization, etc.
Example:
- impose a factorization, \(\gamma(x, z) = {\text{prior}}(p) \prod_{i=1}^4 {\text{likelihood}}(x_i | p)\)
- further assume the following parametric forms:
- \({\text{prior}}(p) = {{\bf 1}}_{[0,1]}(p)\) (“Uniform”)
- \({\text{likelihood}}(x_i | p) = p^{x_i} (1-p)^{1-x_i}\) (“Bernoulli”)
Poll: Now that I have made these assumptions, do I have ONE probability model?
- No, \(p\) is still unknown
- No, \(x\) is still unknown
- Yes, by Bayes rule
- Yes, by the Law of total probability
- Yes, by chain rule
Prior and likelihood?
- All textbooks describe a Bayesian model as:
- a prior, \({\text{prior}}(z)\), combined with
- a likelihood \(\text{likelihood}(x|z)\)
- I describe Bayesian models as a joint distribution \(\gamma(z, x)\)
- In simple cases, the two are equivalent
- By chain rule, \(\gamma(z, x) = \text{prior}(z) \text{likelihood}(x | z)\)
- But for more complicated models it can be awkward to separate what is the likelihood and what is the prior
Step 2: Conditioning
\[{\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) \color{green}{| X}] : a \in {\mathcal{A}}\}\] \(\color{green}{\text{Compute or approximate the posterior distribution conditionally on the actual data at hand}}\)
- So far, the only aspect of the data that we have used is its datatype (3 binary variables \(x_1, x_2, x_3\))
- We have not used the actual contents of our “dataset”: zero failures, i.e. (0, 0, 0)
- The actual data is introduced by conditioning
What you learned about conditioning in a undergraduate course
To compute \({\mathbf{E}}[f(Z) | X]|_{x}\):
- First compute the posterior distribution using Bayes rule \[p(z|x) = \frac{{\text{prior}}(z) {\text{likelihood}}(x|z)}{\int {\text{prior}}(z') {\text{likelihood}}(x|z') {\text{d}}z}\]
- Then, compute the expectation under the posterior distribution \[{\mathbf{E}}[f(Z) | X]|_{x} = \int f(z) p(z | x) {\text{d}}z\]
What you need to know about conditioning
Recall: Bayesians almost never use Bayes rule!
Instead:
- They specify a joint distribution \(\gamma(z, x)\) (often in a probabilistic programming language)
- They use the key equation: \(\pi(z) \propto \gamma(z)\), where \(\pi(z) = p(z|x)\), \(\gamma(z) = \gamma(z, x)\)
- In words: the posterior distribution is proportional to the joint distribution (viewing both as functions of \(z\) only)
- This is all MCMC needs to know (true for the algorithm, but also for its designer, and its users)
Conditioning using a probabilistic programming language
Our model is:
- \({\text{prior}}(p) = {{\bf 1}}_{[0,1]}(p)\) (“Uniform prior”)
- \({\text{likelihood}}(x_i | p) = p^{x_i} (1-p)^{1-x_i}\) (“Bernoulli likelihood”)
Increasingly, applied Bayesian statician use probabilistic programming languages (PPLs) to approximate conditional expectations
Some of you may be familiar with the BUGS family of languages (WinBUGS, OpenBUGS, JAGS, Stan, Blang)
Example:
launch1 | failureProbability ~ Bernoulli(failureProbability)
launch2 | failureProbability ~ Bernoulli(failureProbability)
launch3 | failureProbability ~ Bernoulli(failureProbability)
nextLaunch | failureProbability ~ Bernoulli(failureProbability)
failureProbability ~ ContinuousUniform(0, 1)
- Assume for now that PPLs can “magically” approximate \({\mathbf{E}}[f(Z)|X]\) for any function \(f(z)\), based on the output of a PPL
- Example: recall \(z = (p, x_4)\) and \(x = (x_1, x_2, x_3)\). Soon, we will see that we will need to compute \({\mathbb{P}}(X_4 = 1 | X)\).
Poll: Which function \(f(z)\) should I use to cast \({\mathbb{P}}(X_4 = 1 | X)\) as \({\mathbf{E}}[f(Z)|X]\)
- \(f(z) = z\)
- \(f(z) = {{\bf 1}}[z = 1]\)
- \(f(z) = 1\)
- \(f(z) = {{\bf 1}}[x_4 = 1]\)
- None of the above
Conditioning using PPLs: theory
The PPLs we will use in this course are based on Monte Carlo methods.
The theoretical foundation of Monte Carlo methods: laws of large numbers (LLN) which is any theorem providing a result of the form
\[\frac{1}{K} \sum_{k=1}^K f(Z_k) \to \int f(z) \pi(z) {\text{d}}z\]
where:
- \(K\) is the number of iterations of the Monte Carlo method
- \(\to\) is a suitable notion of convergence (e.g. almost sure)
- we make some assumptions on \(Z_1, Z_2, \dots\),
- in undergraduate course: independent and identically distributed
- for Bayesian inference: we typically use a more advanced result, the law of large numbers for Markov chain (later in course)
- in most Bayesian contexts:
- think of \(z\) as the latent (unobserved) states
- \(\pi(z)\) is the posterior (in the LLN for Markov chains, we design the Markov chain so that its stationary distribution is the posterior distribution)
- we assume \(f \in L_1(\pi)\), i.e. \(\int |f(z)| \pi(z) {\text{d}}z < \infty\)
Conditioning using PPLs: practical example
During the lecture, we went over steps 1 and 2 in this tutorial walking you throught the steps of approximating a posterior distribution using the Blang PPL (focussing on general principles which apply to most other mainstream PPLs, e.g. Stan, JAGS, pyMC). Complete the rest as a short exercise set.
- Connecting theory (last slide, LLN) and the practical example linked above…
- How is the height of the plots
- In step 1 of the tutorial, for pedagogical purpose, we take \(\pi\) to be the joint distribution
- In step 2 of the tutorial, we take \(\pi\) to be the posterior distribution
Step 3: take the Bayes action
\[\color{blue}{{\textrm{argmin}}} \{ {\mathbf{E}}[\color{blue}{L}(a, Z) | X] : a \in {\mathcal{A}}\}\]
\(\color{blue}{\text{Solve an optimation problem to turn the posterior distribution into an "action"}}\)
The loss function in our example should encode the following information:
- Cost of satellite: $50M
- Cost of the insurance policy $1M
Exercise: Design a loss function \(L\) that encodes the above information
Use the notation:
- \(Z = 1\) means failure, \(Z = 0\), success
- set of possible actions: \({\mathcal{A}}= \{0, 1\}\),
- \(a = 0\): do not buy insurance,
- \(a = 1\): buy it
Example of loss function
Losses ($M):
Success \((z=0)\) |
1 |
0 |
Failure \((z=1)\) |
1 |
50 |
Loss function: \(L(a, z) = {{\bf 1}}[a = 1] (\$1M) + {{\bf 1}}[a = 0] {{\bf 1}}[x_4 = 1] (\$50M)\)
Next exercise: simplify the optimization problem in the Bayes estimator
\[\color{blue}{{\textrm{argmin}}} \{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\}\]
Example of a Bayes estimator
\[
\begin{align*}
{\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} &= {\textrm{argmin}}\{ {\mathbf{E}}[{{\bf 1}}[a = 1] + 50 {{\bf 1}}[a = 0] {{\bf 1}}[X_4 = 1] | X] : a \in {\mathcal{A}}\} \\
&= {\textrm{argmin}}\{ {\mathbf{E}}[{{\bf 1}}[a = 1] | X] + 50 {\mathbf{E}}[ {{\bf 1}}[a = 0] {{\bf 1}}[X_4 = 1] | X] : a \in {\mathcal{A}}\} \\
&= {\textrm{argmin}}\{ {{\bf 1}}[a = 1] + 50 {{\bf 1}}[a = 0] {\mathbb{P}}[X_4 = 1 | X] : a \in {\mathcal{A}}\} \\
&= {{\bf 1}}[p_\text{fail}(X) > 1/50]],
\end{align*}
\]
where \(p_\text{fail}(X) = {\mathbb{P}}[X_4 = 1 | X]\) encodes our predicted probability that the next launch will be a failure.
From earlier: “Assume for now that we can ‘magically’ approximate \({\mathbf{E}}[f(Z)|X]\) for any function \(f(z)\), based on the output of a PPL.”
So indeed if we pick \(f(z) = {{\bf 1}}[x_4=1]\), then \({\mathbb{P}}[X_4 = 1|X]\) can be written as \({\mathbf{E}}[f(Z)|X]\), and hence approximated based on the output of a PPL.
Go over step 3 in the first PPL tutorial.