Organization

This week: a Bayesian “bootcamp”. People taking this course typically have a wide range of background. This will make sure we all have a common foundation.

Rest of the course: selected topics from…

“Practicalities” of Bayesian analysis (PPLs, Bayes estimators, model/implementation/inference checks)
Model “building blocks”: common priors, likelihoods, hierarchies, etc (including perhaps a bit of non-parametrics? depending on overlap with Trevor’s course)
Misc topics that are both useful and of active interest in the research community
- Model choice and averaging
- Computational techniques (including perhaps a few advanced MCMC topics, e.g. advanced tempering, PDMPs, etc)
- Approximate Bayesian Computation
- Pseudo-marginal methods
- Model cuts via couplings
- Potentially other topics, if time permits

Logistics

Course web page (linked from Teaching section of my webpage): https://www.stat.ubc.ca/~bouchard/courses/stat520-sp2018-19/index.html
Exercise 1 due next Monday before class. Questions introduced throughout these lecture slides

Bayesian analysis: pros and cons

Address most data analysis issues (missing data, non-standard data types, non-iid, weird loss functions, adding expert knowledge, …)
- Bayesian analysis: address those in a (semi) automated fashion / principled framework (“reductionist”)
  - Reductionism can be bad or good (main con of reductionism is computational)
- Frequentist statistics: every problem is a new problem
Implementation complexity
- Efficient in analyst’s time (thanks to PPLs)
- Harder to scale computationally
- \(\Longrightarrow\) shines on small data problems (there a much more of those than the “big data” hype would like you to think)
Statistical properties
- Optimal if the model is well-specified
- Sub-optimal in certain cases when the model is mis-specified
  - Thankfully the modelling flexibility makes it easier to build better models
  - Important to make model checks

Bayesian analysis: some real examples

drawing

Application topics I am interested in
- Origins of life/cancer/language and characterization of their respective evolutionary processes
- Modelling high-throughput genomics data (single-cell sequencing, CRISPR-CAS9, ultra-deep, expression)
Other examples
- All the classical tasks: classification, regression, clustering, etc
  - Pro: can inject more problem-specific structure
- A/B testing and Bayesian optimization
  - Pro: more interpretable
- Bayesian search
  - Pro: incorporate several sources of information
- Determining fate of the universe from cosmic microwave background
  - Use of complex deterministic models as black-boxes

This week’s example

Would you rather get strapped to…
- “shiny rocket”: 1 success, 0 failures
- “rugged rocket”: 98 successes, 2 failures

drawing

Paradox?

Maximum likelihood point estimates:
- “shiny rocket”: 100% success rate (1 success, 0 failures)
- “rugged rocket”: 98% success rate (98 successes, 2 failures)
What is missing?

Uncertainty estimates

Take-home message:
- Point estimates are often insufficient, and can be very dangerous
- We want some measure of uncertainty
Bayesian inference provides one way to build uncertainty measures
- Bayesian measures of uncertainty we will describe: credible intervals
Alternatives exist:
- Confidence intervals, from frequentist statistics
- “End product” looks similar, but very different in interpretation and construction
- Will postpone comparison of the two until we have introduced credible intervals

Uncertainty will not go away

drawing

Just collect more data! Otherwise you inference will not be very useful anyways.
- Just launch more rockets and wait?
- In some cases the data is gone, i.e. we will never be able to collect more after a point (e.g.: phylogenetic tree inference)

Bayesian inference: high-level picture

Construct a probability model including
- random variables for what we will measure/observe
- random variables for the unknown quantities
  - those we are interested in (“parameters”, “predictions”)
  - others that just help us formulate the problem (“nuisance”, “random effects”).
Compute the posterior distribution conditionally on the actual data at hand
Use the posterior distribution to:
- make prediction (point estimate)
- estimate uncertainty (credible intervals)
- make a decision (more on this later)

Step 3 is unified into one framework: decision theory.

First part: “constructing a probability model”

What is a model?
What is a probability model?
Example: building a probability model for the rocket launch problem.

Scientific models

A simplification of reality amenable to mathematical investigation.

\[\text{Reality} \xrightarrow{\text{Art + Scientific method}} \text{Model} \xrightarrow{\text{Mathematics}} \text{Prediction}\]

In this course “mathematics” will be Bayesian analysis/probability theory.
Bayesian analysis/probability theory assume a model as starting point.
- To create a first model is a bit of an art. It comes with data analysis experience.
- Then after we start with an initial model we can improve it by checking predictions against reality.

Towards a model: biased coins

drawing

Standard coin: when you flip,
- get a heads (H; encoded as 1) with probability 0.5
- get a tails (T; encoded as 0) with probability 0.5
“Biased” coin: when you flip,
- get a H with probability \(p \in [0, 1]\)
- get a T with probability \(1 - p\)
- Synonym: flips are Bernoulli random variables
For us:
- 1 \(\leftrightarrow\) successful launch
- 0 \(\leftrightarrow\) failed launch
- Problem: we do not know \(p\)!

Making \(p\) random

drawing

Imagine a bag with \(K+1\) biased coins
Coin number \(i\) in \(\{0, 1, 2, ..., K\}\) has bias \(p_i = i/K\)
Example: \(K+1 = 3\) coins
- First coin: \(p_1 = 0/2 = 0\)
- Second coin: \(p_2 = 1/2\)
- Third coin: \(p_3 = 2/2 = 1\)

Simple probability model for the rocket launch problem

drawing

Step 1: Pick one of the \(K+1\) coins from the bucket
- \(\leftrightarrow\) Design a new rocket
Step 2: Repeatedly flip the same coin
- \(\leftrightarrow\) Build a bunch of copies of the rocket, launch them and enjoy the show

Warm-up calculation / probability diagnostic

Model
- Step 1: Pick one of the \(K+1 = 3\) coins from the bucket
- Step 2: Repeatedly flip the same coin
Diagnostic (part 1):
- If the first \(n=3\) flips all return heads, what is the posterior probability that the standard (\(p = 1/2\)) coin was picked?

drawing

Diagnostic (part 2): given you observe the same data (3 Heads in a row), what is your belief that the next draw is again a Heads?

Exercise 1, question 1: write a function taking as input:

Prior distribution (stored as an array) over the coins in the bucket (e.g., here uniform)
Observed flips (encoded as number of trials and number of successes)

and return as output an array of posterior probabilities.

Recap: some probability vocabulary

drawing

Decision tree: a recursive classification of all possible scenarios
Nodes in the tree are “groups of scenarios” which we call events
Children of a node break an event into an exhaustive list of subcases, this is called a partition
- For example, \(C_0, C_1, C_2\) is a partition of \(S\) since \(C_i \cap C_j = \emptyset\) and \(\cup_i C_i = S\).
Here we did this until there is only one scenario in the events at the leaves
- We call one individual scenario an outcome
- We call the set of all outcomes the sample space, \(S\), and put it at the root of decision trees.
A probability is a function \({\mathbb{P}}\) that satisfy the following constraints:
- \({\mathbb{P}}\) should take events as input and return a number between zero and one (\({\mathbb{P}}(E) \in [0, 1]\))
- If \(E_1, E_2, \dots\) is a partition of \(E\), then \({\mathbb{P}}(E) = \sum_i {\mathbb{P}}(E_i)\) (“additivity”)
- \({\mathbb{P}}(S) = 1\)

Recap: axioms/constraints of probability

drawing

A probability is a function \({\mathbb{P}}\) that satisfy the following constraints:
- \({\mathbb{P}}\) should take events as input and return a number between zero and one (\({\mathbb{P}}(E) \in [0, 1]\))
- If \(E_1, E_2, \dots\) is a partition of \(E\), then \({\mathbb{P}}(E) = \sum_i {\mathbb{P}}(E_i)\)
- \({\mathbb{P}}(S) = 1\)
Thanks to the constraints, even if I only specify a few known probabilities I can recover many other ones mathematically/computationally.
Next: recap of main tools to carry these computations.
- Conditioning
- Chain rule
- Law of total probability

Recap: conditioning

The workhorse of Bayesian inference!
- Used to define models (what we will do now)
- And soon, to get point estimates and credible intervals

drawing

Important: we want the result of this procedure, \({\mathbb{P}}(\cdot | E)\) to be a probability when viewed as a function of the first argument for any fixed \(E\).

Recap: definition of conditional probability \({\mathbb{P}}(A | E)\)

Another (higher level) picture:

drawing

For a query even \(A\), what should be the updated probability?
We want to remove from \(A\) all the outcomes that are not compatible with the new information \(E\). How?
- Take the intersection (\(\cap\)): \(A \cap E\)
- We also want: \({\mathbb{P}}(S | E) = 1\) Why?
- How? Renormalize: \[{\mathbb{P}}(A | E) = \frac{{\mathbb{P}}(A \cap E)}{{\mathbb{P}}(E)}\]
- Intersection can also be denoted using a comma, for example \({\mathbb{P}}(A \cap E) = {\mathbb{P}}(A, E)\)

We can now formalize the question

Model
- Step 1: Pick one of the \(K+1 = 3\) coins from the bucket
  - \(\leftrightarrow {\mathbb{P}}(C_i) = 1/3\)
- Step 2: Repeatedly flip the same coin
  - \(\leftrightarrow {\mathbb{P}}(H_k | C_i, H_{k-1}, \dots) = {\mathbb{P}}(H_k | C_i) = i/2\)
Query
- If the first \(n=3\) flips all return heads, what is the posterior probability that the standard (\(p = 1/2\)) coin was picked?
  - Compute \({\mathbb{P}}(C_1 | H_1, H_2, H_3)\)
Strategy (Bayes’ rule):
1. Attack the slightly more general problem \(\pi(i) = {\mathbb{P}}(C_i | H_1, H_2, H_3)\)
2. By definition of conditioning \[\pi(i) = \frac{{\mathbb{P}}(C_i, H_1, H_2, H_3)}{{\mathbb{P}}(H_1, H_2, H_3)}\] let us call the numerator \(\gamma(i)\) and the denominator, \(Z\), i.e. \(\pi(i) = \frac{\gamma(i)}{Z}\)
3. Start by computing \(\gamma(0), \gamma(1), \gamma(2)\)

Labelling decision trees with (conditional) probabilities

drawing

If there is an edge pointing to \(\to E\), annotate it with \({\mathbb{P}}(E | \text{intersection of all events ancestral to } E)\).
- Note: since at each step we take a partition, \((\text{intersection of all events ancestral to } E) = (\text{parent of } E)\)
Chain rule: the probability of a leaf is the product of the annotated probabilities on the path to the root.
Mathematically:
- \({\mathbb{P}}(E_1 \cap E_2) = {\mathbb{P}}(E_1) {\mathbb{P}}(E_2 | E_1)\) (direct from definition)
- True for any order, i.e. \({\mathbb{P}}(E_1 \cap E_2) = {\mathbb{P}}(E_2) {\mathbb{P}}(E_1 | E_2)\)
More generally:
- \({\mathbb{P}}(E_1 \cap E_2 \cap E_3 \cap \dots) = {\mathbb{P}}(E_1) {\mathbb{P}}(E_2 | E_1) {\mathbb{P}}(E_3 | E_1 \cap E_2) \dots\)
Example: back to our calculation, \(\gamma(0) = 0\) while \[\begin{aligned} \gamma(1) = {\mathbb{P}}(C_1, H_1, H_2, H_3) &= {\mathbb{P}}(C_1) {\mathbb{P}}(H_1 | C_1) {\mathbb{P}}(H_2 | C_1, H_1) {\mathbb{P}}(H_3 | C_1, H_1, H_2) \\ &= {\mathbb{P}}(C_1) {\mathbb{P}}(H_1 | C_1) {\mathbb{P}}(H_2 | C_1) {\mathbb{P}}(H_3 | C_1) = (1/3) (1/2) (1/2) (1/2) = 1/24 \\ \gamma(1) = {\mathbb{P}}(C_2, H_1, H_2, H_3) &= {\mathbb{P}}(C_2) {\mathbb{P}}(H_1 | C_2) {\mathbb{P}}(H_2 | C_2, H_1) {\mathbb{P}}(H_3 | C_2, H_1, H_2) \\ &= {\mathbb{P}}(C_2) {\mathbb{P}}(H_1 | C_2) {\mathbb{P}}(H_2 | C_2) {\mathbb{P}}(H_3 | C_2) = 1/3 \end{aligned}\]

Last step in the computation

Attack the slightly more general problem \(\pi(i) = {\mathbb{P}}(C_i | H_1, H_2, H_3)\)
By definition of conditioning \[\pi(i) = \frac{{\mathbb{P}}(C_i, H_1, H_2, H_3)}{{\mathbb{P}}(H_1, H_2, H_3)}\] let us call the numerator \(\gamma(i)\) and the denominator, \(Z\), i.e. \(\pi(i) = \frac{\gamma(i)}{Z}\)
Start by computing \(\gamma(0), \gamma(1), \gamma(2)\) (using chain rule, see previous slide)
Note \(Z = \gamma(0) + \gamma(1) + \gamma(2)\) (why?).
- Hence can compute \(\pi\) by normalizing the vector \(\gamma\) into a vector that to sum to one.

Mathematical expression for the above algorithm (Bayes’ rule): if \(i \in \{0, 1, \dots, K\}\) indexes a partition \(C_i\) of \(S\), then the posterior distribution, \(\pi(i) = {\mathbb{P}}(C_i | D)\), is proportional to the joint distribution \(\gamma(i) = {\mathbb{P}}(C_i, D)\) in the sense that

\[ \pi(i) = \frac{\gamma(i)}{\sum_{i=0}^K \gamma(i)} \] Answer: \({\mathbb{P}}(C_1 | H_1, H_2, H_3) = 1/9\).

From posterior distributions to point estimates

For large \(K\), you get a large vector \((\pi(0), \pi(1), \dots, \pi(K))\)
How to visualize \(\pi(i)\)?
How to provide a numerical summary?
Tool we will use: random variables

Recap: random variables

Motivation: a better notation for creating partitions
Idea: a random variable is a labelling function, which assign a label to each outcome
- Example: the number of times you see heads \(N\)
- The bias index, \(I\)
How to create a partition:

Probabilist’s notation

drawing

Examples:
- \(C_2 = (I = 2)\)
- \(H_1 \cap H_2 \cap H_3 = (N = 3)\)
Probability Mass Function (PMF): the PMF of random variable \(X\), denoted \(p\) or \(p_X\) if disambiguation is needed, is defined as \(p(x) = {\mathbb{P}}(X = x)\)

Example: coins

drawing

PMF of fair coin: \(p(0) = p(1) = 1/2\)
Generalization: the family of Bernoulli PMFs
Recall: when you flip a “Biased” coin
- get a 1 with probability \(p \in [0, 1]\)
- get a 0 with probability \(1 - p\)
\(p\) is an example of what is called a parameter
Notation: \(X \sim {\text{Bern}}(p)\)
PMF:

Example: discrete uniform distributions

Example: dice
\(Y \sim {\text{Unif}}\{1, 2, 3, 4, 5, 6\}\)
PMF:

drawing

Better notation

Using our slick probabilist’s notation introduce at the end of last lecture, our model is:

\[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\} \end{align}\]

drawing

Recall:

Interpretation of \(I\): index of the coin I pick, \(X_m\), the \(m\)-th flip
The notation \(X_m | I \sim {\text{Bern}}(I/K)\) means..
- The conditional PMF of the random variable \(X_m\) given \(I\) is \({\text{Bern}}(I/K)\)
  - I.e. for any \(x \in {\mathbf{R}}\), \(I \in \{1, 2, \dots, K\}\), the following conditional probability is obtained by looking up the picture on the right
  - Or, mathematically

\[\begin{align}{\mathbb{P}}(X_m = x | I = i) = \begin{cases} i/K, & \text{if } x = 1 \\ 1 - i/K, & \text{if } x = 0 \\ 0, & \text{otherwise} \end{cases} \end{align}\]

Point estimate (summaries) for PMFs and posterior PMF

Posterior PMF of a random variable \(X\) given observation \(D\) defined as you would expect, \(\pi(x) = {\mathbb{P}}(X = x | D)\)
Usual suspects for providing point estimate summaries of a PMF \(p(x)\)
- Mode: a global optimum of the PMF, \(\text{argmax}_x p(x)\)
  - In the context of posterior distribution, called the maximum a posteriori (MAP)
- Mean: \(\sum_{x} x p(x)\)
  - In the context of posterior distribution, called the posterior mean

More on decision trees: drawing your own

drawing

Recall we computed each \(\gamma(i)\) by multiplying conditional probabilities on the edges of a path in a decision tree. I gave you the decision tree, how to do draw your own decision tree from scratch?

Idea:

write down the model as we did before \[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\} \end{align}\]
Each chunk of the form \(Y | Z\) tells you that you should make the decisions for random variable \(Y\) after those for random variable \(Z\).

Rationale: mathematically, chain rule works in any order (\({\mathbb{P}}(A, B) = {\mathbb{P}}(A){\mathbb{P}}(B|A) = {\mathbb{P}}(B){\mathbb{P}}(A|B)\)), but here we want to pick an order such that all the conditional distributions are known (part of the model specification)

Example: for the coin bucket model, this tells us we should draw \(I\) first, then the \(X_m\)’s (and that the order among the \(X\)’s does not matter).

Large decision trees: graphical models

drawing

Question: how to do this for a large model? For example: \[\begin{align*} X_1 &\sim {\text{Unif}}(0, 1/2, 1) \\ X_2 | X_1 &\sim {\text{Bern}}(X_1) \\ X_3 | X_1, X_4 &\sim {\text{Unif}}\{1, X_1, X_4\} \\ X_4 &\sim {\text{Bern}}(1/3) \\ X_5 | X_3 &\sim {\text{Unif}}\{0, .., X_3\} \\ X_7 &\sim {\text{Bern}}(1/4) \\ X_6 | X_4, X_7 &\sim {\text{Unif}}\{X_4, X_7\} \\ X_8 &\sim {\text{Bern}}(1/4) \end{align*}\]

Idea:

Build a graph, where nodes are random variables and edges are dependencies coming from the given conditional distributions. If you see \(X | Y\), add an edge \(Y \to X\).
Use an ordering of the nodes that respects the directionality of all the edges.

Terminology: this graph is called a (directed) graphical model or Bayes net

Generating synthetic datasets

Graphical models also provide a way to generate synthetic datasets.
Simply follow the same ordering as discussed in the previous slide
- but this time sample realizations of each (conditional) distribution.
- Much cheaper than decision trees! Linear complexity instead of exponential.
Useful for:
- benchmarking as the synthetic dataset hence created contains both observation and the “unknowns”
- defining important concepts such as calibration and optimality (soon)

Measures of uncertainty

You are given a level, typically \(95\%\) or \(90\%\), denoted \(1-\alpha = 0.95\) or \(1-\alpha = 0.9\)
Ideally, we would want to find a set \(T\) such that \(\sum_{x \in T} p(x) = 0.9\)
- For discrete distributions, there may not be a solution, but let us ignore that issue for now (suppose each probability is very small and replace \(=\) by \(\approx\) up to an error bounded by the largest individual probability)
- For continuous distributions that will not be an issue
There is still a problem: there are many solutions! Some of them might appear weird, e.g. include points of probability zero!
- Example: both sets shown in red contain \(75\%\) of the mass

drawing

High Density Intervals (HDI)

High Density “Intervals”: pick a solution with the least number of points \(|T|\)
To handle discrete models: \(\text{argmin}\{ |T| : \sum_{x \in T} p(x) \ge 0.9\}\).
- Still not unique in some corner cases, but good enough
“Interval” in name is unfortunate as it is not actually always an interval
- Example: multimodal distribution
- Sometimes called high density region

Example: the left one below is the HDI because the number of points, \(|T|\) is smaller (\(5 < 9\))

drawing

Exercise 1, question 2: implement a function taking as input a posterior PMF (as a vector), a level, and returns an HDI.

Prediction

drawing

Diagnostic (part 2): given you observe the same data (3 Heads in a row), what is your belief that the next draw is again a Heads?

\[\begin{align} I &\sim {\text{Unif}}\{0, 1, 2, \dots, (K-1), K\} \\ X_m | I &\sim {\text{Bern}}(I/K); m \in \{1, 2, 3\} \end{align}\]

Prediction: using posterior probability as the updated belief

drawing

Good approach:

Add a random variable for the quantity you want to predict (next flip \(N\)).
Add one level in the decision tree.
Use Bayes rule.
- Twist: distinct paths lead to the same prediction
- Sum the probabilities of the paths you collapse (why can we do this?).
  - Two ways to get \((H_{1:3}, N = \text{H again})\): \(\gamma(1) = 1/48 + 1/3\)
  - Only one way to get \((H_{1:3}, N = \text{T})\): \(\gamma(0) = 1/48\)
- As before, normalize \(\gamma\) to get \(\pi\): \[{\mathbb{P}}(N = \text{H again}|H_{1:3}) = \frac{\gamma(1)}{\gamma(0) + \gamma(1)} = 17/18\]

Prediction example: all cases

Let now \(\pi\) denote a vector where the first component is \({\mathbb{P}}(N = \text{Tails}|H_{1:3})\) and the second, \({\mathbb{P}}(N = \text{Heads}|H_{1:3})\).

If you see “Heads, Heads, Heads” updated beliefs: \(\pi = (1/18, 17/18)\)
- Interpretation: “I predict another Heads with 94% confidence”
Exercise: if you see “Heads, Tails, Tails” or any other mix of the two: \(\pi = (1/2, 1/2)\)
- Interpretation: “I do not lean on either predictions” (if forced, flip your own coin!)
By symmetry: if you see “Tails, Tails, Tails”: \(\pi = (17/18, 1/18)\)
- Interpretation: “I predict another Tails with 94% confidence”

Recap: law of total probability

The process of adding the probabilities of the paths is called the law of total probability, mathematically, if \(E_i\) is a partition of the sample space, then for any event \(A\), \[{\mathbb{P}}(A) = \sum_i {\mathbb{P}}(A, E_i)\]

Why it matters: this provides a principled way to handle many data analysis issues such as missing data, prediction, censoring, etc. More example of that later.

Informally: “marginalize” over all the uncertain quantities. In particular, in the Bayesian framework, during prediction, the parameters can be better seen as “nuisance” parameters.

Prediction: “bad approach”

In the prediction exercise, some of you may have used the following argument:

Compute a point estimate on \(p\).
Then look at the probability of H and T for a Bernoulli with the point estimate plugged-in

Warning: in general this is a suboptimal way to update beliefs.

But: in this specific example, if you take the point estimate to be the posterior mean, you will find the same answer. This is not true in the vast majority of cases! Here this works because the Bernoulli has a single parameter, its mean. Try it for other models next week, you will see the plug-in method gives a different, sub-optimal answer.

Calibrated prediction uncertainty

Question: In what sense a belief update is “optimal”?

Consider the following experiment based on repeated synthetic data generation followed by prediction evaluation

set.seed(1)
rbern <- function(n, prob) { return(rbinom(n, 1, prob)) }
rdunif <- function(max) { return(ceiling(max*runif(1))) } # unif 1, 2, .., max

# We will keep track for each synthetic dataset of
# whether we successfully predicted (we know truth because we are using synthetic)
success <- c()    
nIters <- 10000
for (iteration in seq(1:nIters)) {
  # Data generation
  i <- rdunif(3) - 1 # Uniform on {0, 1, 2}
  x <- rbern(3, i/2.0)
  nextThrow <- rbern(1, i/2.0)
  nHeads <- sum(x)
  # Record prediction statistics for the case of interest
  if (nHeads == 3) { 
    pred <- 1 
    success <- c( success, (if (pred == nextThrow) 1 else 0))
  }
}

# Plot evolution of the statistics across iterations
df <- data.frame("iteration" = c(1:length(success)), success)
df <- df %>%
  mutate(
    successRate = cumsum(success) / seq_along(success)
  )
ggplot(data=df, aes(x=iteration, y=successRate)) +
  geom_hline(yintercept=17/18, linetype="dashed", color = "black") + 
  geom_line()

Calibration

This property, being right 94% of the time when you say you are 94% sure“, is called calibration
For Bayesian methods, calibration is always satisfied if the model is true
Things are more complicated if the model is not true (“M-open”), more on that in the exercise.

Exercise 1, question 3:

Back to the problem of estimation of \(p\), let us consider first a uniform prior with \(K = 1000\) say. Generate 1000 datasets each of size 10 from the model and use your posterior and HDI function at 90% to perform Bayesian inference. Is the HDI calibrated? What does it mean exactly?
Now generate the data from a different prior, say a discretized normal distribution with a mean of 0.2 and standard deviation of 0.2. How does this affect calibration? Do the same for a case where each dataset has size 100 instead. Then when each has size 1000.

Exercise 1, optional question: apply your posterior inference method to real data. For example, the rocket launch data.

Exercise 1, optional question: a randomized algorithm can be viewed as a function that takes as input a special object which provides access to randomness. Let us assume this object only provides one function, called “bernoulli(p)”, to access iid Bernoullis. Write a function that takes an arbitrary randomized algorithm as output and prints out the decision tree.

STAT 520 - Bayesian Analysis

Organization

Logistics

Bayesian analysis: pros and cons

Bayesian analysis: some real examples

This week’s example

Paradox?

Uncertainty estimates

Uncertainty will not go away

Bayesian inference: high-level picture

First part: “constructing a probability model”

Scientific models

Towards a model: biased coins

Making \(p\) random

Simple probability model for the rocket launch problem

Warm-up calculation / probability diagnostic

Recap: some probability vocabulary

Recap: axioms/constraints of probability

Recap: conditioning

Recap: definition of conditional probability \({\mathbb{P}}(A | E)\)

We can now formalize the question

Labelling decision trees with (conditional) probabilities

Last step in the computation

From posterior distributions to point estimates

Recap: random variables

Probabilist’s notation

Example: coins

Example: discrete uniform distributions

Better notation

Point estimate (summaries) for PMFs and posterior PMF

More on decision trees: drawing your own

Large decision trees: graphical models

Generating synthetic datasets

Measures of uncertainty

High Density Intervals (HDI)

Prediction

Prediction: using posterior probability as the updated belief

Prediction example: all cases

Recap: law of total probability

Prediction: “bad approach”

Calibrated prediction uncertainty

Calibration