Running example: insurance policy for a Delta 7925H

drawing

You are consulting for a satellite operator
They are about to send a $50M satellite on a Delta 7925H rocket
Data: as of November 2020, Delta 7925H rockets have been launched 3 times, with 0 failed launches
- Note: Delta 7925H is not reusable, so each rocket is “copy” built from the same blueprint
Should you recommend buying a $1M insurance policy?

Poll: Back of the envelope recommendation

No, since estimated probability of failure is equal to $0 = 0/3$
No, since estimated probability of failure is between 0 and some constant
No, since estimated probability of failure is between some constant and one
Yes, since estimated probability of failure is smaller than some constant
Yes, since estimated probability of failure is greater than some constant

Bayes estimator: first overview

The Bayes estimator,

\[\color{blue}{{\textrm{argmin}}} \{ \color{red}{{\mathbf{E}}}[\color{blue}{L}(a, Z) \color{green}{| X}] : a \in {\mathcal{A}}\}\]

encodes a 3-step approach applicable to almost any statistical problems:

$\color{red}{\text{Construct a probability model}}$
$\color{green}{\text{Compute or approximate the posterior distribution conditionally on the actual data at hand}}$
$\color{blue}{\text{Solve an optimation problem to turn the posterior distribution into an "action"}}$

Let us examplify this process to the Delta rocket insurance problem.

Step 1: Probability model

\[{\textrm{argmin}}\{ \color{red}{{\mathbf{E}}}[L(a, Z) | X] : a \in {\mathcal{A}}\}\]

A Bayesian model is a probability model equipped with a:

random variable representing the data, $X = (X_1, X_2, X_3)$ where
- $X_i = 1$ means rocket launch $i$ was a failure
random variable for the unknown quantities, $Z$
- in our example, this is just the fourth launch $Z = X_4$

Concretely: a Bayesian model = a joint density $\gamma(x, z)$

Poll: What is the space of possible Bayesian models for the Delta rocket example?

a single parameter, $p \in [0, 1]$
a discrete space of cardinality $2^4$
the space of lists of $2^4$ positive real numbers $p_i \in [0, 1]$ such that $\sum_{i=1}^{2^4} p_i = 1$.
all possible probability distributions
none of the above

Choosing a Bayesian model

The set of possible Bayesian models in the rocket example is equivalent to a list of $2^4$ real numbers $p_i \in [0, 1]$, which sum to one (note: fixed bug, first version wrote $4$ instead of $2^4$)
The Bayesian analyst has to pick ONE point in this set
- (Sharp contrast with frequentist models)
How to pick one model:
- With practice! This is a bit of an art form.
- Then, use the scientific method (more on this when we talk about model selection)

Two techniques for building Bayesian models

Technique one: introducing nuisance variables

Augment the joint distribution with more unknown quantities, often called “parameters”.

Example:

add to $\gamma(x, z)$ one more component: in our example, a real number $p \in [0, 1]$,
- interpreted as a failure probability that will be shared by all launches.
we now have to specify $\gamma(x, \tilde z)$, where $\tilde z = (z, p)$, did that really help?
for now on, I will use $z \gets \tilde z$ to simplify notation.

Two techniques for building Bayesian models

Technique two: impose “regularities”

Symmetries, (conditional) independences, identical distributions, factorization, etc.

Example:

impose a factorization, $\gamma(x, z) = {\text{prior}}(p) \prod_{i=1}^4 {\text{likelihood}}(x_i | p)$
further assume the following parametric forms:
- ${\text{prior}}(p) = {{\bf 1}}_{[0,1]}(p)$ (“Uniform”)
- ${\text{likelihood}}(x_i | p) = p^{x_i} (1-p)^{1-x_i}$ (“Bernoulli”)

Poll: Now that I have made these assumptions, do I have ONE probability model?

No, $p$ is still unknown
No, $x$ is still unknown
Yes, by Bayes rule
Yes, by the Law of total probability
Yes, by chain rule

Prior and likelihood?

All textbooks describe a Bayesian model as:
- a prior, ${\text{prior}}(z)$, combined with
- a likelihood $\text{likelihood}(x|z)$
I describe Bayesian models as a joint distribution $\gamma(z, x)$
In simple cases, the two are equivalent
- By chain rule, $\gamma(z, x) = \text{prior}(z) \text{likelihood}(x | z)$
But for more complicated models it can be awkward to separate what is the likelihood and what is the prior

Step 2: Conditioning

\[{\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) \color{green}{| X}] : a \in {\mathcal{A}}\}\] $\color{green}{\text{Compute or approximate the posterior distribution conditionally on the actual data at hand}}$

So far, the only aspect of the data that we have used is its datatype (3 binary variables $x_1, x_2, x_3$)
We have not used the actual contents of our “dataset”: zero failures, i.e. (0, 0, 0)
- The actual data is introduced by conditioning

What you learned about conditioning in a undergraduate course

To compute ${\mathbf{E}}[f(Z) | X]|_{x}$:

First compute the posterior distribution using Bayes rule \[p(z|x) = \frac{{\text{prior}}(z) {\text{likelihood}}(x|z)}{\int {\text{prior}}(z') {\text{likelihood}}(x|z') {\text{d}}z}\]
Then, compute the expectation under the posterior distribution \[{\mathbf{E}}[f(Z) | X]|_{x} = \int f(z) p(z | x) {\text{d}}z\]

What you need to know about conditioning

Recall: Bayesians almost never use Bayes rule!

Instead:

They specify a joint distribution $\gamma(z, x)$ (often in a probabilistic programming language)
They use the key equation: $\pi(z) \propto \gamma(z)$, where $\pi(z) = p(z|x)$, $\gamma(z) = \gamma(z, x)$
- In words: the posterior distribution is proportional to the joint distribution (viewing both as functions of $z$ only)
- This is all MCMC needs to know (true for the algorithm, but also for its designer, and its users)

Conditioning using a probabilistic programming language

Our model is:

${\text{prior}}(p) = {{\bf 1}}_{[0,1]}(p)$ (“Uniform prior”)
${\text{likelihood}}(x_i | p) = p^{x_i} (1-p)^{1-x_i}$ (“Bernoulli likelihood”)

Increasingly, applied Bayesian statician use probabilistic programming languages (PPLs) to approximate conditional expectations

Some of you may be familiar with the BUGS family of languages (WinBUGS, OpenBUGS, JAGS, Stan, Blang)

Example:

launch1 | failureProbability ~ Bernoulli(failureProbability)
launch2 | failureProbability ~ Bernoulli(failureProbability)
launch3 | failureProbability ~ Bernoulli(failureProbability)

nextLaunch | failureProbability ~ Bernoulli(failureProbability)

failureProbability ~ ContinuousUniform(0, 1)

Assume for now that PPLs can “magically” approximate ${\mathbf{E}}[f(Z)|X]$ for any function $f(z)$, based on the output of a PPL
Example: recall $z = (p, x_4)$ and $x = (x_1, x_2, x_3)$. Soon, we will see that we will need to compute ${\mathbb{P}}(X_4 = 1 | X)$.

Poll: Which function $f(z)$ should I use to cast ${\mathbb{P}}(X_4 = 1 | X)$ as ${\mathbf{E}}[f(Z)|X]$

$f(z) = z$
$f(z) = {{\bf 1}}[z = 1]$
$f(z) = 1$
$f(z) = {{\bf 1}}[x_4 = 1]$
None of the above

Conditioning using PPLs: theory

The PPLs we will use in this course are based on Monte Carlo methods.

The theoretical foundation of Monte Carlo methods: laws of large numbers (LLN) which is any theorem providing a result of the form

\[\frac{1}{K} \sum_{k=1}^K f(Z_k) \to \int f(z) \pi(z) {\text{d}}z\]

where:

$K$ is the number of iterations of the Monte Carlo method
$\to$ is a suitable notion of convergence (e.g. almost sure)
we make some assumptions on $Z_1, Z_2, \dots$,
- in undergraduate course: independent and identically distributed
- for Bayesian inference: we typically use a more advanced result, the law of large numbers for Markov chain (later in course)
in most Bayesian contexts:
- think of $z$ as the latent (unobserved) states
- $\pi(z)$ is the posterior (in the LLN for Markov chains, we design the Markov chain so that its stationary distribution is the posterior distribution)
we assume $f \in L_1(\pi)$, i.e. $\int |f(z)| \pi(z) {\text{d}}z < \infty$

Conditioning using PPLs: practical example

During the lecture, we went over steps 1 and 2 in this tutorial walking you throught the steps of approximating a posterior distribution using the Blang PPL (focussing on general principles which apply to most other mainstream PPLs, e.g. Stan, JAGS, pyMC).

Connecting theory (last slide, LLN) and the practical example linked above…
- In step 1 of the tutorial, for pedagogical purpose, we take $\pi$ to be the joint distribution
- In step 2 of the tutorial, we take $\pi$ to be the posterior distribution

Step 3: take the Bayes action

\[\color{blue}{{\textrm{argmin}}} \{ {\mathbf{E}}[\color{blue}{L}(a, Z) | X] : a \in {\mathcal{A}}\}\]

$\color{blue}{\text{Solve an optimation problem to turn the posterior distribution into an "action"}}$

The loss function in our example should encode the following information:

Cost of satellite: $50M
Cost of the insurance policy $1M

Exercise: Design a loss function $L$ that encodes the above information

Use the notation:

$Z = 1$ means failure, $Z = 0$, success
set of possible actions: ${\mathcal{A}}= \{0, 1\}$,
- $a = 0$: do not buy insurance,
- $a = 1$: buy it

Example of loss function

Losses ($M):

Launch	With insurance $(a=1)$	Without $(a=0)$
Success $(z=0)$	1	0
Failure $(z=1)$	1	50

Loss function: $L(a, z) = {{\bf 1}}[a = 1] (\$1M) + {{\bf 1}}[a = 0] {{\bf 1}}[x_4 = 1] (\$50M)$

Next exercise: simplify the optimization problem in the Bayes estimator

\[\color{blue}{{\textrm{argmin}}} \{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\}\]

Example of a Bayes estimator

\[ \begin{align*} {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} &= {\textrm{argmin}}\{ {\mathbf{E}}[{{\bf 1}}[a = 1] + 50 {{\bf 1}}[a = 0] {{\bf 1}}[X_4 = 1] | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[{{\bf 1}}[a = 1] | X] + 50 {\mathbf{E}}[ {{\bf 1}}[a = 0] {{\bf 1}}[X_4 = 1] | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {{\bf 1}}[a = 1] + 50 {{\bf 1}}[a = 0] {\mathbb{P}}[X_4 = 1 | X] : a \in {\mathcal{A}}\} \\ &= {{\bf 1}}[p_\text{fail}(X) > 1/50]], \end{align*} \]

where $p_\text{fail}(X) = {\mathbb{P}}[X_4 = 1 | X]$ encodes our predicted probability that the next launch will be a failure.

From earlier: “Assume for now that we can ‘magically’ approximate ${\mathbf{E}}[f(Z)|X]$ for any function $f(z)$, based on the output of a PPL.”

So indeed if we pick $f(z) = {{\bf 1}}[x_4=1]$, then ${\mathbb{P}}[X_4 = 1|X]$ can be written as ${\mathbf{E}}[f(Z)|X]$, and hence approximated based on the output of a PPL.

Go over step 3 in the first PPL tutorial.

Problem (based on a recent PhD defence in which I was external examiner)

Let ${\delta_{\text{B}}}$ denote the Bayes estimator, and ${\delta_{\text{o}}}$, another estimator, based on a deep neural network.
To compare the two estimators, the student generated a large number of datasets $(Z_i, X_i)$ from the model
For each dataset, compute $a_i = {\delta_{\text{B}}}(X_i)$, compute the loss $L(a_i, Z_i)$, and score the Bayesian method with the average loss
Do the same with the other estimator
The student reported a lower loss for the other estimator, ${\delta_{\text{o}}}$

Poll: Which of these statements is true?

The deep neural network can perform better because it involves more parameters (over-parameterization)
The deep neural network can perform better because of a combination of lucky initialization and noisy gradients
The deep neural network can perform better because of some other reason
This cannot happen unless the loss function is non-convex
This cannot happen because of some other reason

Optimality of Bayesian estimators

Correct answer is: “This cannot happen because of some other reason”
We can state this because of another optimality property enjoyed by Bayes estimators
Let us start by stating this optimality property
How to encode mathematically the notion of average loss in the description below?
- Let ${\delta_{\text{B}}}$ denote the Bayes estimator, and ${\delta_{\text{o}}}$, another estimator, based on a deep neural network.
- To compare the two estimators, the student generated a large number of datasets $(Z_i, X_i)$ from the model
- For each dataset, compute $a_i = {\delta_{\text{B}}}(X_i)$, compute the loss $L(a_i, Z_i)$, and score the Bayesian method with the average loss

Optimality of Bayesian estimators: theorem

Claim: ${\mathbf{E}}[L({\delta_{\text{B}}}(X), Z)] \le {\mathbf{E}}[L({\delta_{\text{o}}}(X), Z)]$

Proof: by the law of total expectation: ${\mathbf{E}}[L({\delta_{\text{B}}}(X), Z)] = {\mathbf{E}}[ {\mathbf{E}}[L({\delta_{\text{B}}}(X), Z) | X]]$.

Now denote the objective being minimized in the Bayes estimator by: ${\mathscr{O}}(a) = {\mathbf{E}}[L(a, Z) | X]$, so that we can rewrite the above with:

\[{\mathbf{E}}[ {\mathbf{E}}[L({\delta_{\text{B}}}(X), Z) | X]] = {\mathbf{E}}[ {\mathscr{O}}( {\textrm{argmin}}_a {\mathscr{O}}(a))]\]

Now note that for all $a'$, ${\mathscr{O}}( {\textrm{argmin}}_a {\mathscr{O}}(a)) \le {\mathscr{O}}(a')$, hence:

\[ \begin{align*} {\mathbf{E}}[L({\delta_{\text{B}}}(X), Z)] &= {\mathbf{E}}[ {\mathscr{O}}( {\textrm{argmin}}_a {\mathscr{O}}(a))] \\ &\le {\mathbf{E}}[ {\mathscr{O}}( {\delta_{\text{o}}}(X))] \\ &= {\mathbf{E}}[L({\delta_{\text{o}}}(X), Z)]. \end{align*} \]

Discussion: Why do you think the student observed a lower loss expected for ${\delta_{\text{o}}}$ in their experiment?

Launch	With insurance \((a=1)\)	Without \((a=0)\)
Success \((z=0)\)	1	0
Failure \((z=1)\)	1	50

Bayes estimator: a first example

Running example: insurance policy for a Delta 7925H

Poll: Back of the envelope recommendation

Bayes estimator: first overview

Step 1: Probability model

Poll: What is the space of possible Bayesian models for the Delta rocket example?

Choosing a Bayesian model

Two techniques for building Bayesian models

Technique one: introducing nuisance variables

Two techniques for building Bayesian models

Technique two: impose “regularities”

Poll: Now that I have made these assumptions, do I have ONE probability model?

Prior and likelihood?

Step 2: Conditioning

What you learned about conditioning in a undergraduate course

What you need to know about conditioning

Conditioning using a probabilistic programming language

Poll: Which function \(f(z)\) should I use to cast \({\mathbb{P}}(X_4 = 1 | X)\) as \({\mathbf{E}}[f(Z)|X]\)

Conditioning using PPLs: theory

Conditioning using PPLs: practical example

Step 3: take the Bayes action

Example of loss function

Example of a Bayes estimator

Problem (based on a recent PhD defence in which I was external examiner)

Poll: Which of these statements is true?

Optimality of Bayesian estimators

Optimality of Bayesian estimators: theorem