Processing math: 100%
STAT 520 - Bayesian Analysis
Alexandre Bouchard-Côté
2/25/2019
Organization
This week: a Bayesian “bootcamp”. People taking this course typically have a wide range of background. This will make sure we all have a common foundation.
Rest of the course: selected topics from…
- “Practicalities” of Bayesian analysis (PPLs, Bayes estimators, model/implementation/inference checks)
- Model “building blocks”: common priors, likelihoods, hierarchies, etc (including perhaps a bit of non-parametrics? depending on overlap with Trevor’s course)
- Misc topics that are both useful and of active interest in the research community
- Model choice and averaging
- Computational techniques (including perhaps a few advanced MCMC topics, e.g. advanced tempering, PDMPs, etc)
- Approximate Bayesian Computation
- Pseudo-marginal methods
- Model cuts via couplings
- Potentially other topics, if time permits
Bayesian analysis: pros and cons
- Address most data analysis issues (missing data, non-standard data types, non-iid, weird loss functions, adding expert knowledge, …)
- Bayesian analysis: address those in a (semi) automated fashion / principled framework (“reductionist”)
- Reductionism can be bad or good (main con of reductionism is computational)
- Frequentist statistics: every problem is a new problem
- Implementation complexity
- Efficient in analyst’s time (thanks to PPLs)
- Harder to scale computationally
- ⟹ shines on small data problems (there a much more of those than the “big data” hype would like you to think)
- Statistical properties
- Optimal if the model is well-specified
- Sub-optimal in certain cases when the model is mis-specified
- Thankfully the modelling flexibility makes it easier to build better models
- Important to make model checks
Bayesian analysis: some real examples

- Application topics I am interested in
- Origins of life/cancer/language and characterization of their respective evolutionary processes
- Modelling high-throughput genomics data (single-cell sequencing, CRISPR-CAS9, ultra-deep, expression)
- Other examples
- All the classical tasks: classification, regression, clustering, etc
- Pro: can inject more problem-specific structure
- A/B testing and Bayesian optimization
- Bayesian search
- Pro: incorporate several sources of information
- Determining fate of the universe from cosmic microwave background
- Use of complex deterministic models as black-boxes
This week’s example
- Would you rather get strapped to…
- “shiny rocket”: 1 success, 0 failures
- “rugged rocket”: 98 successes, 2 failures

Paradox?
- Maximum likelihood point estimates:
- “shiny rocket”: 100% success rate (1 success, 0 failures)
- “rugged rocket”: 98% success rate (98 successes, 2 failures)
- What is missing?
Uncertainty estimates
- Take-home message:
- Point estimates are often insufficient, and can be very dangerous
- We want some measure of uncertainty
- Bayesian inference provides one way to build uncertainty measures
- Bayesian measures of uncertainty we will describe: credible intervals
- Alternatives exist:
- Confidence intervals, from frequentist statistics
- “End product” looks similar, but very different in interpretation and construction
- Will postpone comparison of the two until we have introduced credible intervals
Uncertainty will not go away

- Just collect more data! Otherwise you inference will not be very useful anyways.
- Just launch more rockets and wait?
- In some cases the data is gone, i.e. we will never be able to collect more after a point (e.g.: phylogenetic tree inference)
Bayesian inference: high-level picture
- Construct a probability model including
- random variables for what we will measure/observe
- random variables for the unknown quantities
- those we are interested in (“parameters”, “predictions”)
- others that just help us formulate the problem (“nuisance”, “random effects”).
- Compute the posterior distribution conditionally on the actual data at hand
- Use the posterior distribution to:
- make prediction (point estimate)
- estimate uncertainty (credible intervals)
- make a decision (more on this later)
Step 3 is unified into one framework: decision theory.
First part: “constructing a probability model”
- What is a model?
- What is a probability model?
- Example: building a probability model for the rocket launch problem.
Scientific models
A simplification of reality amenable to mathematical investigation.
RealityArt + Scientific method→ModelMathematics→Prediction
- In this course “mathematics” will be Bayesian analysis/probability theory.
- Bayesian analysis/probability theory assume a model as starting point.
- To create a first model is a bit of an art. It comes with data analysis experience.
- Then after we start with an initial model we can improve it by checking predictions against reality.
Towards a model: biased coins

- Standard coin: when you flip,
- get a heads (H; encoded as 1) with probability 0.5
- get a tails (T; encoded as 0) with probability 0.5
- “Biased” coin: when you flip,
- get a H with probability p∈[0,1]
- get a T with probability 1−p
- Synonym: flips are Bernoulli random variables
- For us:
- 1 ↔ successful launch
- 0 ↔ failed launch
- Problem: we do not know p!
Making p random

- Imagine a bag with K+1 biased coins
- Coin number i in {0,1,2,...,K} has bias pi=i/K
- Example: K+1=3 coins
- First coin: p1=0/2=0
- Second coin: p2=1/2
- Third coin: p3=2/2=1
Simple probability model for the rocket launch problem

- Step 1: Pick one of the K+1 coins from the bucket
- Step 2: Repeatedly flip the same coin
- ↔ Build a bunch of copies of the rocket, launch them and enjoy the show
Warm-up calculation / probability diagnostic
- Model
- Step 1: Pick one of the K+1=3 coins from the bucket
- Step 2: Repeatedly flip the same coin
- Diagnostic (part 1):
- If the first n=3 flips all return heads, what is the posterior probability that the standard (p=1/2) coin was picked?

Diagnostic (part 2): given you observe the same data (3 Heads in a row), what is your belief that the next draw is again a Heads?
Exercise 1, question 1: write a function taking as input:
- Prior distribution (stored as an array) over the coins in the bucket (e.g., here uniform)
- Observed flips (encoded as number of trials and number of successes)
and return as output an array of posterior probabilities.
Recap: some probability vocabulary

- Decision tree: a recursive classification of all possible scenarios
- Nodes in the tree are “groups of scenarios” which we call events
- Children of a node break an event into an exhaustive list of subcases, this is called a partition
- For example, C0,C1,C2 is a partition of S since Ci∩Cj=∅ and ∪iCi=S.
- Here we did this until there is only one scenario in the events at the leaves
- We call one individual scenario an outcome
- We call the set of all outcomes the sample space, S, and put it at the root of decision trees.
- A probability is a function P that satisfy the following constraints:
- P should take events as input and return a number between zero and one (P(E)∈[0,1])
- If E1,E2,… is a partition of E, then P(E)=∑iP(Ei) (“additivity”)
- P(S)=1
Recap: axioms/constraints of probability

- A probability is a function P that satisfy the following constraints:
- P should take events as input and return a number between zero and one (P(E)∈[0,1])
- If E1,E2,… is a partition of E, then P(E)=∑iP(Ei)
- P(S)=1
- Thanks to the constraints, even if I only specify a few known probabilities I can recover many other ones mathematically/computationally.
- Next: recap of main tools to carry these computations.
- Conditioning
- Chain rule
- Law of total probability
Recap: conditioning
- The workhorse of Bayesian inference!
- Used to define models (what we will do now)
- And soon, to get point estimates and credible intervals

- Important: we want the result of this procedure, P(⋅|E) to be a probability when viewed as a function of the first argument for any fixed E.
Recap: definition of conditional probability P(A|E)
- Another (higher level) picture:


- For a query even A, what should be the updated probability?
- We want to remove from A all the outcomes that are not compatible with the new information E. How?
- Take the intersection (∩): A∩E
- We also want: P(S|E)=1 Why?
- How? Renormalize: P(A|E)=P(A∩E)P(E)
- Intersection can also be denoted using a comma, for example P(A∩E)=P(A,E)
Labelling decision trees with (conditional) probabilities

- If there is an edge pointing to →E, annotate it with P(E|intersection of all events ancestral to E).
- Note: since at each step we take a partition, (intersection of all events ancestral to E)=(parent of E)
- Chain rule: the probability of a leaf is the product of the annotated probabilities on the path to the root.
- Mathematically:
- P(E1∩E2)=P(E1)P(E2|E1) (direct from definition)
- True for any order, i.e. P(E1∩E2)=P(E2)P(E1|E2)
- More generally:
- P(E1∩E2∩E3∩…)=P(E1)P(E2|E1)P(E3|E1∩E2)…
- Example: back to our calculation, γ(0)=0 while γ(1)=P(C1,H1,H2,H3)=P(C1)P(H1|C1)P(H2|C1,H1)P(H3|C1,H1,H2)=P(C1)P(H1|C1)P(H2|C1)P(H3|C1)=(1/3)(1/2)(1/2)(1/2)=1/24γ(1)=P(C2,H1,H2,H3)=P(C2)P(H1|C2)P(H2|C2,H1)P(H3|C2,H1,H2)=P(C2)P(H1|C2)P(H2|C2)P(H3|C2)=1/3
Last step in the computation
- Attack the slightly more general problem π(i)=P(Ci|H1,H2,H3)
- By definition of conditioning π(i)=P(Ci,H1,H2,H3)P(H1,H2,H3) let us call the numerator γ(i) and the denominator, Z, i.e. π(i)=γ(i)Z
- Start by computing γ(0),γ(1),γ(2) (using chain rule, see previous slide)
- Note Z=γ(0)+γ(1)+γ(2) (why?).
- Hence can compute π by normalizing the vector γ into a vector that to sum to one.
Mathematical expression for the above algorithm (Bayes’ rule): if i∈{0,1,…,K} indexes a partition Ci of S, then the posterior distribution, π(i)=P(Ci|D), is proportional to the joint distribution γ(i)=P(Ci,D) in the sense that
π(i)=γ(i)∑Ki=0γ(i) Answer: P(C1|H1,H2,H3)=1/9.
From posterior distributions to point estimates
- For large K, you get a large vector (π(0),π(1),…,π(K))
- How to visualize π(i)?
- How to provide a numerical summary?
- Tool we will use: random variables
Recap: random variables
- Motivation: a better notation for creating partitions
- Idea: a random variable is a labelling function, which assign a label to each outcome
- Example: the number of times you see heads N
- The bias index, I
- How to create a partition:

Probabilist’s notation

- Examples:
- Probability Mass Function (PMF): the PMF of random variable X, denoted p or pX if disambiguation is needed, is defined as p(x)=P(X=x)
Example: coins

- PMF of fair coin: p(0)=p(1)=1/2
- Generalization: the family of Bernoulli PMFs
- Recall: when you flip a “Biased” coin
- get a 1 with probability p∈[0,1]
- get a 0 with probability 1−p
- p is an example of what is called a parameter
- Notation: X∼Bern(p)
- PMF:
Better notation
Using our slick probabilist’s notation introduce at the end of last lecture, our model is:
I∼Unif{0,1,2,…,(K−1),K}Xm|I∼Bern(I/K);m∈{1,2,3}

Recall:
- Interpretation of I: index of the coin I pick, Xm, the m-th flip
- The notation Xm|I∼Bern(I/K) means..
- The conditional PMF of the random variable Xm given I is Bern(I/K)
- I.e. for any x∈R, I∈{1,2,…,K}, the following conditional probability is obtained by looking up the picture on the right
- Or, mathematically
P(Xm=x|I=i)={i/K,if x=11−i/K,if x=00,otherwise
Point estimate (summaries) for PMFs and posterior PMF
- Posterior PMF of a random variable X given observation D defined as you would expect, π(x)=P(X=x|D)
- Usual suspects for providing point estimate summaries of a PMF p(x)
- Mode: a global optimum of the PMF, argmaxxp(x)
- In the context of posterior distribution, called the maximum a posteriori (MAP)
- Mean: ∑xxp(x)
- In the context of posterior distribution, called the posterior mean
More on decision trees: drawing your own

Recall we computed each γ(i) by multiplying conditional probabilities on the edges of a path in a decision tree. I gave you the decision tree, how to do draw your own decision tree from scratch?
Idea:
- write down the model as we did before I∼Unif{0,1,2,…,(K−1),K}Xm|I∼Bern(I/K);m∈{1,2,3}
- Each chunk of the form Y|Z tells you that you should make the decisions for random variable Y after those for random variable Z.
Rationale: mathematically, chain rule works in any order (P(A,B)=P(A)P(B|A)=P(B)P(A|B)), but here we want to pick an order such that all the conditional distributions are known (part of the model specification)
Example: for the coin bucket model, this tells us we should draw I first, then the Xm’s (and that the order among the X’s does not matter).
Large decision trees: graphical models

Question: how to do this for a large model? For example: X1∼Unif(0,1/2,1)X2|X1∼Bern(X1)X3|X1,X4∼Unif{1,X1,X4}X4∼Bern(1/3)X5|X3∼Unif{0,..,X3}X7∼Bern(1/4)X6|X4,X7∼Unif{X4,X7}X8∼Bern(1/4)
Idea:
- Build a graph, where nodes are random variables and edges are dependencies coming from the given conditional distributions. If you see X|Y, add an edge Y→X.
- Use an ordering of the nodes that respects the directionality of all the edges.
Terminology: this graph is called a (directed) graphical model or Bayes net
Generating synthetic datasets
- Graphical models also provide a way to generate synthetic datasets.
- Simply follow the same ordering as discussed in the previous slide
- but this time sample realizations of each (conditional) distribution.
- Much cheaper than decision trees! Linear complexity instead of exponential.
- Useful for:
- benchmarking as the synthetic dataset hence created contains both observation and the “unknowns”
- defining important concepts such as calibration and optimality (soon)
Measures of uncertainty
- You are given a level, typically 95% or 90%, denoted 1−α=0.95 or 1−α=0.9
- Ideally, we would want to find a set T such that ∑x∈Tp(x)=0.9
- For discrete distributions, there may not be a solution, but let us ignore that issue for now (suppose each probability is very small and replace = by ≈ up to an error bounded by the largest individual probability)
- For continuous distributions that will not be an issue
- There is still a problem: there are many solutions! Some of them might appear weird, e.g. include points of probability zero!
- Example: both sets shown in red contain 75% of the mass

High Density Intervals (HDI)
- High Density “Intervals”: pick a solution with the least number of points |T|
- To handle discrete models: argmin{|T|:∑x∈Tp(x)≥0.9}.
- Still not unique in some corner cases, but good enough
- “Interval” in name is unfortunate as it is not actually always an interval
- Example: multimodal distribution
- Sometimes called high density region
Example: the left one below is the HDI because the number of points, |T| is smaller (5<9)

Exercise 1, question 2: implement a function taking as input a posterior PMF (as a vector), a level, and returns an HDI.
Prediction

Diagnostic (part 2): given you observe the same data (3 Heads in a row), what is your belief that the next draw is again a Heads?
I∼Unif{0,1,2,…,(K−1),K}Xm|I∼Bern(I/K);m∈{1,2,3}
Prediction: using posterior probability as the updated belief

Good approach:
- Add a random variable for the quantity you want to predict (next flip N).
- Add one level in the decision tree.
- Use Bayes rule.
- Twist: distinct paths lead to the same prediction
- Sum the probabilities of the paths you collapse (why can we do this?).
- Two ways to get (H1:3,N=H again): γ(1)=1/48+1/3
- Only one way to get (H1:3,N=T): γ(0)=1/48
- As before, normalize γ to get π: P(N=H again|H1:3)=γ(1)γ(0)+γ(1)=17/18
Prediction example: all cases
Let now π denote a vector where the first component is P(N=Tails|H1:3) and the second, P(N=Heads|H1:3).
- If you see “Heads, Heads, Heads” updated beliefs: π=(1/18,17/18)
- Interpretation: “I predict another Heads with 94% confidence”
- Exercise: if you see “Heads, Tails, Tails” or any other mix of the two: π=(1/2,1/2)
- Interpretation: “I do not lean on either predictions” (if forced, flip your own coin!)
- By symmetry: if you see “Tails, Tails, Tails”: π=(17/18,1/18)
- Interpretation: “I predict another Tails with 94% confidence”
Recap: law of total probability
The process of adding the probabilities of the paths is called the law of total probability, mathematically, if Ei is a partition of the sample space, then for any event A, P(A)=∑iP(A,Ei)
Why it matters: this provides a principled way to handle many data analysis issues such as missing data, prediction, censoring, etc. More example of that later.
Informally: “marginalize” over all the uncertain quantities. In particular, in the Bayesian framework, during prediction, the parameters can be better seen as “nuisance” parameters.
Prediction: “bad approach”
In the prediction exercise, some of you may have used the following argument:
- Compute a point estimate on p.
- Then look at the probability of H and T for a Bernoulli with the point estimate plugged-in
Warning: in general this is a suboptimal way to update beliefs.
But: in this specific example, if you take the point estimate to be the posterior mean, you will find the same answer. This is not true in the vast majority of cases! Here this works because the Bernoulli has a single parameter, its mean. Try it for other models next week, you will see the plug-in method gives a different, sub-optimal answer.
Calibrated prediction uncertainty
Question: In what sense a belief update is “optimal”?
Consider the following experiment based on repeated synthetic data generation followed by prediction evaluation
set.seed(1)
rbern <- function(n, prob) { return(rbinom(n, 1, prob)) }
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max
# We will keep track for each synthetic dataset of
# whether we successfully predicted (we know truth because we are using synthetic)
success <- c()
nIters <- 10000
for (iteration in seq(1:nIters)) {
# Data generation
i <- rdunif(3) - 1 # Uniform on {0, 1, 2}
x <- rbern(3, i/2.0)
nextThrow <- rbern(1, i/2.0)
nHeads <- sum(x)
# Record prediction statistics for the case of interest
if (nHeads == 3) {
pred <- 1
success <- c( success, (if (pred == nextThrow) 1 else 0))
}
}
# Plot evolution of the statistics across iterations
df <- data.frame("iteration" = c(1:length(success)), success)
df <- df %>%
mutate(
successRate = cumsum(success) / seq_along(success)
)
ggplot(data=df, aes(x=iteration, y=successRate)) +
geom_hline(yintercept=17/18, linetype="dashed", color = "black") +
geom_line()

Calibration
- This property, being right 94% of the time when you say you are 94% sure“, is called calibration
- For Bayesian methods, calibration is always satisfied if the model is true
- Things are more complicated if the model is not true (“M-open”), more on that in the exercise.
Exercise 1, question 3:
- Back to the problem of estimation of p, let us consider first a uniform prior with K=1000 say. Generate 1000 datasets each of size 10 from the model and use your posterior and HDI function at 90% to perform Bayesian inference. Is the HDI calibrated? What does it mean exactly?
- Now generate the data from a different prior, say a discretized normal distribution with a mean of 0.2 and standard deviation of 0.2. How does this affect calibration? Do the same for a case where each dataset has size 100 instead. Then when each has size 1000.
Exercise 1, optional question: apply your posterior inference method to real data. For example, the rocket launch data.
Exercise 1, optional question: a randomized algorithm can be viewed as a function that takes as input a special object which provides access to randomness. Let us assume this object only provides one function, called “bernoulli(p)”, to access iid Bernoullis. Write a function that takes an arbitrary randomized algorithm as output and prints out the decision tree.
Space, Right Arrow or swipe left to move to next slide, click help below for more details