Processing math: 100%
Bayes estimator: a first example
Alexandre Bouchard-Côté
Running example: insurance policy for a Delta 7925H

- You are consulting for a satellite operator
- They are about to send a $50M satellite on a Delta 7925H rocket
- Data: as of November 2020, Delta 7925H rockets have been launched 3 times, with 0 failed launches
- Note: Delta 7925H is not reusable, so each rocket is “copy” built from the same blueprint
- Should you recommend buying a $1M insurance policy?
Poll: Back of the envelope recommendation
- No, since estimated probability of failure is equal to 0=0/3
- No, since estimated probability of failure is between 0 and some constant
- No, since estimated probability of failure is between some constant and one
- Yes, since estimated probability of failure is smaller than some constant
- Yes, since estimated probability of failure is greater than some constant
Bayes estimator: first overview
The Bayes estimator,
argmin{E[L(a,Z)|X]:a∈A}
encodes a 3-step approach applicable to almost any statistical problems:
- Construct a probability model
- Compute or approximate the posterior distribution conditionally on the actual data at hand
- Solve an optimation problem to turn the posterior distribution into an "action"
Let us examplify this process to the Delta rocket insurance problem.
Step 1: Probability model
argmin{E[L(a,Z)|X]:a∈A}
A Bayesian model is a probability model equipped with a:
- random variable representing the data, X=(X1,X2,X3) where
- Xi=1 means rocket launch i was a failure
- random variable for the unknown quantities, Z
- in our example, this is just the fourth launch Z=X4
Concretely: a Bayesian model = a joint density γ(x,z)
Poll: What is the space of possible Bayesian models for the Delta rocket example?
- a single parameter, p∈[0,1]
- a discrete space of cardinality 24
- the space of lists of 24 positive real numbers pi∈[0,1] such that ∑24i=1pi=1.
- all possible probability distributions
- none of the above
Choosing a Bayesian model
- The set of possible Bayesian models in the rocket example is equivalent to a list of 24 real numbers pi∈[0,1], which sum to one (note: fixed bug, first version wrote 4 instead of 24)
- The Bayesian analyst has to pick ONE point in this set
- (Sharp contrast with frequentist models)
- How to pick one model:
- With practice! This is a bit of an art form.
- Then, use the scientific method (more on this when we talk about model selection)
Two techniques for building Bayesian models
Technique one: introducing nuisance variables
Augment the joint distribution with more unknown quantities, often called “parameters”.
Example:
- add to γ(x,z) one more component: in our example, a real number p∈[0,1],
- interpreted as a failure probability that will be shared by all launches.
- we now have to specify γ(x,˜z), where ˜z=(z,p), did that really help?
- for now on, I will use z←˜z to simplify notation.
Two techniques for building Bayesian models
Technique two: impose “regularities”
Symmetries, (conditional) independences, identical distributions, factorization, etc.
Example:
- impose a factorization, γ(x,z)=prior(p)∏4i=1likelihood(xi|p)
- further assume the following parametric forms:
- prior(p)=1[0,1](p) (“Uniform”)
- likelihood(xi|p)=pxi(1−p)1−xi (“Bernoulli”)
Poll: Now that I have made these assumptions, do I have ONE probability model?
- No, p is still unknown
- No, x is still unknown
- Yes, by Bayes rule
- Yes, by the Law of total probability
- Yes, by chain rule
Prior and likelihood?
- All textbooks describe a Bayesian model as:
- a prior, prior(z), combined with
- a likelihood likelihood(x|z)
- I describe Bayesian models as a joint distribution γ(z,x)
- In simple cases, the two are equivalent
- By chain rule, γ(z,x)=prior(z)likelihood(x|z)
- But for more complicated models it can be awkward to separate what is the likelihood and what is the prior
Step 2: Conditioning
argmin{E[L(a,Z)|X]:a∈A} Compute or approximate the posterior distribution conditionally on the actual data at hand
- So far, the only aspect of the data that we have used is its datatype (3 binary variables x1,x2,x3)
- We have not used the actual contents of our “dataset”: zero failures, i.e. (0, 0, 0)
- The actual data is introduced by conditioning
What you learned about conditioning in a undergraduate course
To compute E[f(Z)|X]|x:
- First compute the posterior distribution using Bayes rule p(z|x)=prior(z)likelihood(x|z)∫prior(z′)likelihood(x|z′)dz
- Then, compute the expectation under the posterior distribution E[f(Z)|X]|x=∫f(z)p(z|x)dz
What you need to know about conditioning
Recall: Bayesians almost never use Bayes rule!
Instead:
- They specify a joint distribution γ(z,x) (often in a probabilistic programming language)
- They use the key equation: π(z)∝γ(z), where π(z)=p(z|x), γ(z)=γ(z,x)
- In words: the posterior distribution is proportional to the joint distribution (viewing both as functions of z only)
- This is all MCMC needs to know (true for the algorithm, but also for its designer, and its users)
Conditioning using a probabilistic programming language
Our model is:
- prior(p)=1[0,1](p) (“Uniform prior”)
- likelihood(xi|p)=pxi(1−p)1−xi (“Bernoulli likelihood”)
Increasingly, applied Bayesian statician use probabilistic programming languages (PPLs) to approximate conditional expectations
Some of you may be familiar with the BUGS family of languages (WinBUGS, OpenBUGS, JAGS, Stan, Blang)
Example:
launch1 | failureProbability ~ Bernoulli(failureProbability)
launch2 | failureProbability ~ Bernoulli(failureProbability)
launch3 | failureProbability ~ Bernoulli(failureProbability)
nextLaunch | failureProbability ~ Bernoulli(failureProbability)
failureProbability ~ ContinuousUniform(0, 1)
- Assume for now that PPLs can “magically” approximate E[f(Z)|X] for any function f(z), based on the output of a PPL
- Example: recall z=(p,x4) and x=(x1,x2,x3). Soon, we will see that we will need to compute P(X4=1|X).
Poll: Which function f(z) should I use to cast P(X4=1|X) as E[f(Z)|X]
- f(z)=z
- f(z)=1[z=1]
- f(z)=1
- f(z)=1[x4=1]
- None of the above
Conditioning using PPLs: theory
The PPLs we will use in this course are based on Monte Carlo methods.
The theoretical foundation of Monte Carlo methods: laws of large numbers (LLN) which is any theorem providing a result of the form
1KK∑k=1f(Zk)→∫f(z)π(z)dz
where:
- K is the number of iterations of the Monte Carlo method
- → is a suitable notion of convergence (e.g. almost sure)
- we make some assumptions on Z1,Z2,…,
- in undergraduate course: independent and identically distributed
- for Bayesian inference: we typically use a more advanced result, the law of large numbers for Markov chain (later in course)
- in most Bayesian contexts:
- think of z as the latent (unobserved) states
- π(z) is the posterior (in the LLN for Markov chains, we design the Markov chain so that its stationary distribution is the posterior distribution)
- we assume f∈L1(π), i.e. ∫|f(z)|π(z)dz<∞
Conditioning using PPLs: practical example
During the lecture, we went over steps 1 and 2 in this tutorial walking you throught the steps of approximating a posterior distribution using the Blang PPL (focussing on general principles which apply to most other mainstream PPLs, e.g. Stan, JAGS, pyMC).
- Connecting theory (last slide, LLN) and the practical example linked above…
- In step 1 of the tutorial, for pedagogical purpose, we take π to be the joint distribution
- In step 2 of the tutorial, we take π to be the posterior distribution
Step 3: take the Bayes action
argmin{E[L(a,Z)|X]:a∈A}
Solve an optimation problem to turn the posterior distribution into an "action"
The loss function in our example should encode the following information:
- Cost of satellite: $50M
- Cost of the insurance policy $1M
Exercise: Design a loss function L that encodes the above information
Use the notation:
- Z=1 means failure, Z=0, success
- set of possible actions: A={0,1},
- a=0: do not buy insurance,
- a=1: buy it
Example of loss function
Losses ($M):
Success (z=0) |
1 |
0 |
Failure (z=1) |
1 |
50 |
Loss function: L(a,z)=1[a=1]($1M)+1[a=0]1[x4=1]($50M)
Next exercise: simplify the optimization problem in the Bayes estimator
argmin{E[L(a,Z)|X]:a∈A}
Example of a Bayes estimator
argmin{E[L(a,Z)|X]:a∈A}=argmin{E[1[a=1]+501[a=0]1[X4=1]|X]:a∈A}=argmin{E[1[a=1]|X]+50E[1[a=0]1[X4=1]|X]:a∈A}=argmin{1[a=1]+501[a=0]P[X4=1|X]:a∈A}=1[pfail(X)>1/50]],
where pfail(X)=P[X4=1|X] encodes our predicted probability that the next launch will be a failure.
From earlier: “Assume for now that we can ‘magically’ approximate E[f(Z)|X] for any function f(z), based on the output of a PPL.”
So indeed if we pick f(z)=1[x4=1], then P[X4=1|X] can be written as E[f(Z)|X], and hence approximated based on the output of a PPL.
Go over step 3 in the first PPL tutorial.
Problem (based on a recent PhD defence in which I was external examiner)
- Let δB denote the Bayes estimator, and δo, another estimator, based on a deep neural network.
- To compare the two estimators, the student generated a large number of datasets (Zi,Xi) from the model
- For each dataset, compute ai=δB(Xi), compute the loss L(ai,Zi), and score the Bayesian method with the average loss
- Do the same with the other estimator
- The student reported a lower loss for the other estimator, δo
Poll: Which of these statements is true?
- The deep neural network can perform better because it involves more parameters (over-parameterization)
- The deep neural network can perform better because of a combination of lucky initialization and noisy gradients
- The deep neural network can perform better because of some other reason
- This cannot happen unless the loss function is non-convex
- This cannot happen because of some other reason
Optimality of Bayesian estimators
- Correct answer is: “This cannot happen because of some other reason”
- We can state this because of another optimality property enjoyed by Bayes estimators
- Let us start by stating this optimality property
- How to encode mathematically the notion of average loss in the description below?
- Let δB denote the Bayes estimator, and δo, another estimator, based on a deep neural network.
- To compare the two estimators, the student generated a large number of datasets (Zi,Xi) from the model
- For each dataset, compute ai=δB(Xi), compute the loss L(ai,Zi), and score the Bayesian method with the average loss
Optimality of Bayesian estimators: theorem
Claim: E[L(δB(X),Z)]≤E[L(δo(X),Z)]
Proof: by the law of total expectation: E[L(δB(X),Z)]=E[E[L(δB(X),Z)|X]].
Now denote the objective being minimized in the Bayes estimator by: O(a)=E[L(a,Z)|X], so that we can rewrite the above with:
E[E[L(δB(X),Z)|X]]=E[O(argminaO(a))]
Now note that for all a′, O(argminaO(a))≤O(a′), hence:
E[L(δB(X),Z)]=E[O(argminaO(a))]≤E[O(δo(X))]=E[L(δo(X),Z)].
Discussion: Why do you think the student observed a lower loss expected for δo in their experiment?
Space, Right Arrow or swipe left to move to next slide, click help below for more details