STAT 520 - Bayesian Analysis
Alexandre Bouchard-Côté
3/25/2019
Goals
Today:
- Sampling wrap-up
- Bayes Estimators
Logistics
- Solutions to exercise 2 posted
- Due today: project abstract (extension possible without penalty, come talk to me)
- Exercise 3 will be released today/tomorrow, due Tuesday April 2nd
- First question: decision theoretic problem (introduced today)
- Second question: implementing the hierachical model in Blang and Stan (more information will be posted today/tomorrow)
Decision theory: motivation (from Marie’s group research)
- Consider a tracking problem for say a whale
- \(t\): time index of a time series
- \(Z_t\) position of the whale at time \(t\) (unknown, want to estimate)
- \(X_t\) noisy measurements (e.g. from accelerometer sensor)
- Posterior can become bimodal close to the coast line (sketch)
- In such cases, the posterior mean can be on the land… that’s a problem if we are tracking a whale!
- MAP has problems too (could “skip” land)
Decision theory: general motivation
Recall: I promised Bayesian statistics provides a “meta-recipe” to attack new statistical problems.
Decision theory: the last ingredient that was missing for the meta-recipe
- Construct a probability model including
- random variables for what we will measure/observe
- random variables for the unknown quantities
- those we are interested in (“parameters”, “predictions”)
- others that just help us formulate the problem (“nuisance”, “random effects”).
- Compute the posterior distribution conditionally on the actual data at hand
- Use the posterior distribution to:
- make prediction (point estimate)
- estimate uncertainty (credible intervals)
- make a decision (e.g. pick website A versus B, fly or ground the rocket, etc)
Setup
What we have so far: joint distribution \(f(z, y)\) where our convention today will be
- \(Z\) is unknown
- \(X\) is observed
New today: a mathematical model for decision-making in the face of uncertainty
- a set of actions \({\mathcal{A}}\) (each action \(a\) could be a prediction, a decision, etc)
- a loss function \(L(a, z)\), which specifies a cost incurred for picking action \(a\) when the true state of the world is \(z\)
Preview of what is coming up. The above formalism will allow us to write the most important equation in Bayesian statistic: the Bayes estimator, \(\delta^*(X)\), defined as
\[\delta^*(X) = {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\}\]
Examples of actions
- for prediction/point estimation, \({\mathcal{A}}\) is just the state space of one of the random variables in the graphical model
- to predict outcome of next launch, \({\mathcal{A}}= \{\)success, failure\(\}\)
- to predict failure probability \({\mathcal{A}}= [0, 1]\)
- sometimes, we want to predict all the unknown variables, \({\mathcal{A}}= \{z \in {\mathcal{Z}}\}\)
- for credible regions, \({\mathcal{A}}\) is the set of subsets of the state space of one of the random variables
- for a single interval on a real random variable, \({\mathcal{A}}= \{[c, d] : c < d\}\)
- but actions are more general
- Clinical setting: which treatment/diagnostic to do next?
- A/B testing example: use version A or B?
- but we saw last time it was possible to “augment” the graphical model so that this choice becomes a variable in the augmented model
Examples of loss functions
- for real numbers \(a, z\), \(L(a, z) = (a - z)^2\), the squared loss
- for binary predictions, \(L(a, z) = {{\bf 1}}[a \neq z]\)
A more interesting example: clustering analysis
- In clustering analysis, the objective is to find a partition \(\rho\) which divides the indices of the datapoints into “blocks” or “clusters”.
- The points in the first cluster are explained using a first parametric model, the points in the second cluster are explained using a second parametric model, etc.
- Here we will focus on a loss function for clustering rather than the probabilistic model. We will come back to the probabilistic model for clustering next week.
- For simplicity, let us consider today clustering into two clusters
Popular loss function for clustering: the rand loss.
- True partitions \({\rho}\)
- Guessed partition \({{\rho'}}\)
- rand loss function is denotes as \({\textrm{rand}}({\rho}, {{\rho'}})\).
- Note that we turn the standard notion of rand index into a loss by taking 1 - the rand index.
Definition: The rand loss is defined as the number of (unordered) pairs of data points indices \(\{i,j\}\) such that \((i \sim_{{\rho}} j) \neq (i \sim_{{{\rho'}}} j)\), i.e.:
\[
\sum_{1\le i < j \le n} {{\bf 1}}[(i \sim_{\rho}j) \neq (i \sim_{{\rho'}}j)],
\] where \((i \sim_\rho j) = 1\) if there is a \(B\in\rho\) s.t. \(_{i,j_}\subseteq B\), and \((i \sim_\rho j) = 0\) otherwise.
In other words, a loss of one is incurred each time either: (1) two points are assigned to the same cluster when they belong in separate clusters, or (2) two points are assigned to different clusters when they should be in the same cluster.

NB: The rand loss has several problems, motivating other clustering losses such as the adjusted rand index, but we will look at the rand loss here since the derivation of the Bayes estimator is easy for that particular loss. (See Fritsch and Ickstadt, 2009 for discussion on other losses and how to approach the Bayes estimator optimization problem for these other losses.)
Estimators
- We want to devise a decision-making strategy, which we formalize as an estimator:
- a function that take as input only the observations, \(\delta(x)\), and output a proposed action, \(\delta(x) \in {\mathcal{A}}\).
- i.e., \(\delta : {\mathscr{X}}\to {\mathcal{A}}\)
- We want this estimator to be as “good” as possible.
- Under a certain criterion of goodness, we will see that the Bayesian framework provides a principled and systematic way of specifying a “best” estimator.
Evaluation of estimators
Frequentist risk: view \(\theta = z\) as parameters for a likelihood / indexing probabilities over observables \(\{{\mathbb{P}}_z\}\), with a corresponding collections of expectation operators \(\{{\mathbf{E}}_z\}\), \[
\begin{aligned}
R(z, \delta) &= {\mathbf{E}}_z[L(\delta(X), z)] \\
&= \int L(\delta(x), z)\ \text{likelihood}(x | z) {\text{d}}x
\end{aligned}
\]
Bayesian notion: integrated risk \[
\begin{aligned}
r(\delta) &= {\mathbf{E}}[L(\delta(X), Z)] \\
&= \int \int L(\delta(x), z)\ \text{prior}(z)\ \text{likelihood}(x | z)\ {\text{d}}x {\text{d}}z
\end{aligned}
\]
Key difference:
- Frequentist risk: a partial order on estimators
- Only canonical notion of optimality is then non-dominance, called (statistical) efficiency
- Bayes risk: a complete order on estimators (under weak conditions)
- Can actually get an expression for that optimal estimator
- As a bonus, also satisfies the frequentist notion of non-sub-optimality: Bayes estimators are efficient under weak conditions (more on this later)
The Bayes estimator
So far: abstract definition of Bayes estimators as minimizers of the integrated risk \[
\begin{aligned}
\delta^* &= {\textrm{argmin}}_{\delta : {\mathscr{X}}\to {\mathcal{A}}} \{ r(\delta) \} \\
r(\delta) &= {\mathbf{E}}[L(\delta(X), Z)]
\end{aligned}
\]
More explicit expression: The estimator \(\delta^*\), defined by the equation below, minimizes the integrated risk
\[
\delta^*(X) = {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\}
\]
This estimator \(\delta^*\) is called a Bayes estimator.
This means that given a model and a goal, the Bayesian framework provides in principle a recipe for constructing an estimator.
However, the computation required to implement this recipe may be considerable. This explains why computational statistics plays a large role in Bayesian statistics and in this course.
Examples: from a loss to a Bayes estimator
\[
\begin{aligned}
\delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\
&= {\textrm{argmin}}\{ {\mathbf{E}}[{{\bf 1}}[Z \neq a] | X] : a \in {\mathcal{A}}\} \\
&= {\textrm{argmin}}\{ {\mathbb{P}}(Z \neq a | X) : a \in {\mathcal{A}}\} \\
&= {\textrm{argmin}}\{ 1 - {\mathbb{P}}(Z = a | X) : a \in {\mathcal{A}}\} \\
&= {\textrm{argmax}}\{ {\mathbb{P}}(Z = a | X) : a \in {\mathcal{A}}\} \\
&= \text{MAP estimator}
\end{aligned}
\]
\[
\begin{aligned}
\delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\
&= {\textrm{argmin}}\{ {\mathbf{E}}[(Z - a)^2 | X] : a \in {\mathcal{A}}\} \\
&= {\textrm{argmin}}\{ {\mathbf{E}}[Z^2 | X] - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \\
&= {\textrm{argmin}}\{ - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\}
\end{aligned}
\] Now think of \({\mathbf{E}}[Z|X]\) as a constant that you get from the posterior. To minimize the bottom expression, take derivative with respect to \(a\), equate to zero:
\[
\begin{aligned}
-2 {\mathbf{E}}[Z|X] + 2a = 0
\end{aligned}
\] Hence: here the Bayes estimator is the posterior mean, \(\delta^*(X) = {\mathbf{E}}[Z|X]\).
- credible interval from the Bayes estimator: let us do it for a HDI. Recall, \({\mathcal{A}}= \{[a, b] : c < d\}\), and consider the loss function given by \[
L([c, d], z) = {{\bf 1}}\{z \notin [c, d]\} + k (d - c)
\] for some tuning parameter \(k\). We get:
\[
\begin{aligned}
\delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\
&= {\textrm{argmin}}\{ {\mathbb{P}}[Z \notin [c, d] | X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \\
&= {\textrm{argmin}}\{ {\mathbb{P}}[Z < c|X] + {\mathbb{P}}[Z > d |X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \\
&= {\textrm{argmin}}\{ {\mathbb{P}}[Z \le c|X] - {\mathbb{P}}[Z \le d |X] + k(d - c) : [c,d] \in {\mathcal{A}}\}
\end{aligned}
\] Assuming the posterior has a continuous density \(f\) to change \(<\) into \(\le\). Again we take the derivative with respect to \(c\) and set to zero; then will do the same thing for \(d\). Notice that \({\mathbb{P}}[Z \le c|X]\) is the posterior CDF, so taking the derivative with respect to \(c\) yields a density:
\[
f_{Z|X}(c) - k = 0,
\]
so we see the optimum will be the largest \([c, d]\) such that \(f(c) = f(d) = k\). Tuning \(k\) gives us different HDI levels.
- Exercise question: pessimistic prediction
- Goal: estimate a success probability \(p\)
- But we want to heavily penalize making a prediction \(\hat p\) greater than \(p\), even if it is just a little bit
- Similar situation: “the price is right”
- Write a loss
- Derive a Bayes estimator
Example: clustering, continued
Goal: computing the Bayes estimator derived from the rand loss. \[
\sum_{1\le i < j \le n} {{\bf 1}}[(i \sim_{\rho}j) \neq (i \sim_{{\rho'}}j)],
\]
First, we can write:
\[
\begin{aligned}
{\textrm{argmin}}_{\textrm{partition }\rho'} {\mathbf{E}}\left[{\textrm{rand}}(\rho, \rho')|X\right] &=
{\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} {\mathbf{E}}\left[{{\bf 1}}\left[\rho_{ij} \neq \rho'_{ij}\right]|X\right] \\
&= {\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} \left\{(1-\rho'_{ij}){\mathbb{P}}(\rho_{ij} = 1|X) + \rho'_{ij} \left(1- {\mathbb{P}}(\rho_{ij} = 1 |X)\right)\right\}
\end{aligned}
\]
where \(\rho_{i,j} = (i \sim_{\rho} j)\), which can be viewed as edge indicators on a graph.
The above identity comes from the fact that \(\rho_{i,j}\) is either one or zero, so:
- the first term in the the brackets of the above equation corresponds to the edges not in the partition \(\rho\) (for which we are penalized if the posterior probability of the edge is large), and
- the second term in the same brackets corresponds to the edges in the partition \(\rho\) (for which we are penalized if the posterior probability of the edge is small).
This means that computing an optimal bipartition of the data into two clusters can be done in two steps:
- Simulating a Markov chain, and use the samples to estimate \({s}_{i,j} = {\mathbb{P}}(\rho_{ij} = 1 | Y)\) via Monte Carlo averages.
- Minimize the linear objective function \(\sum_{i<j} \left\{(1-\rho_{ij}){s}_{i,j} + \rho_{ij} \left(1- {s}_{i,j}\right)\right\}\) over bipartitions \(\rho\).
Note that the second step can be efficiently computed using min-flow/max-cut algorithms (understanding how this algorithm works is outside of the scope of this lecture, but if you are curious, see CLRS, chapter 26).
Black box optimization
- Objective function from \(M\) Monte Carlo samples:
\[
\begin{aligned}
\delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\
&\approx {\textrm{argmin}}\{ \frac{1}{M} \sum_{i=1}^M L(a, Z_i) : a \in {\mathcal{A}}\} \\
\end{aligned}
\]
Idea that could be part of a project: stochastic gradient meets Bayes estimators
If \({\mathcal{A}}\) is tricky to explore (combinatorial, constrained such as the motivating tracking problem, etc), and \({\mathcal{A}}= \{z \in {\mathcal{Z}}\}\) can further approximate both the objective and constraints as follows:
\[
\begin{aligned}
\delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\
&\approx {\textrm{argmin}}\{ \frac{1}{M} \sum_{i=1}^M L(a, Z_i) : a \in \{Z_1, Z_2, \dots, Z_M\} \} \\
\end{aligned}
\]
- Idea that could be part of a project: Bayes estimator for feature matrices
Bayes estimators from a frequentist perspective
Recall: admissibility, a frequentist notion of optimality (or rather, non-sub-optimality).
- An estimator \(\delta\) is admissible if there are no dominating estimator \(\delta'\)
- Domination here under the frequentist risk \(R(z, \delta) = {\mathbf{E}}_z[L(\delta(X), z)]\),
- i.e. \(\delta\) is admissible if there is no \(\delta'\) such that for all \(z\), \(R(z, \delta') < R(z, \delta)\)
Proposition: if a Bayes estimator is unique, it is admissible.
To show uniqueness, may try to use convexity of loss function for example.
Discussion on last week example: change point problem / segmentation
- In a healthy cell, chromosomes (long strands of DNA) come in pairs (except for the sex chromosomes).
- In certain cancers, some parts of the chromosome come to lose one of the two copies.
- Suppose we segment one (non-sex) chromosome into segments \(0, 1, 2, \dots, 49\).
- For each segment, let \(S_i\) encodes the number of copies in this segment.
- Examples of possible realizations for the copy numbers:

Change point problem
- Goal of inference: copy numbers, which are hard to observe directly.
- Instead, use sequencing data.
- DNA of many cells is pooled and cut into small pieces
- sequencing machines read some of these small pieces and map into one of the 50 segments.
- Key point: if a segment has two copies, there should be on average twice as many reads mapped into it compared to a segment with only one copy.
- Let \(K\) be the average number of reads we get per segment and per chromosome.
- Could be determined experimentally or inferred, let say we known \(K=3.5\).
Data looks like this:

Can you build a Bayesian model for that?
How to pick distributions?
- Use the type of the realization to narrow down the possibilities
- Example: for the Poisson emissions, we knew the support had to be \(\{0, 1, 2, 3, \dots\}\); this excludes Bernoulli and Uniform.
- Exploratory data analysis
- Ask experts/check literature if possible
- Theoretical grounds (asympotic theory, mechanistic models, etc)
- Example: “law of rare events” provides some interpretation
- Try several methods and use Bayesian model selection techniques (next week, if time permits)
Frequently used distributions
By type:
- Binary data types
- Bernoulli (by definition)
- Enumeration data types
- Categorical (by definition)
- Integers \(\ge 0\)
- Geometric (mode at zero)
- Poisson (mode close to mean)
- For more detailed models, Negative Binomial (2 parameters)
- Continuous \(\ge 0\):
- Continuous in \([0, 1]\):
- Continuous Uniform
- For more detailed models, Beta
- Continuous on real line:
By interpretation:
- Sum of Bernoullis:
- Population survey:
- Waiting times:
- Geometric, Negative Binomial
Mathematical model
- Let \(S_0, S_1, \dots, S_{49}\) denote a list of random variable modelling the unobserved states.
- Each state takes two possible values (the “copy number state”).
- We want to encourage a priori the states to “stick” to the same value (i.e. the copy number does not change too often as we move along the genome), so we set \[P(S_i = s | S_{i-1} = s) = 0.95\]
- Exercise (part 1): how to write this in our mathematical notation, \(S_i | S_{i-1} \sim ???\) (hint: use Bernoullis and change the encoding of the states as 0 and 1 instead of 1 and 2)
- Exercise (part 2): write a JAGS model (hint: check “ifelse(x, a, b)” in the JAGS manual
- Also have a look at the Blang model for comparison
- Let \(Y_0, Y_1, \dots, Y_{49}\) denote the observations (number of reads mapped for each of the 50 segments).
- We will use the following model (a common choice in bioinformatics): \[Y_i | S_i \sim \text{Poisson}(3.5 (S_i + 1))\]
- Recall: Poisson distribution
- Exercise (part 3): draw graphical model
- Recall \(K\) is the average number of reads we get per segment and per chromosome.
- How to proceed is \(K\) is not known?
- Part 4: Implement this idea in Blang as follows:
- In the variable declarations, make \(K\)
random
instead of param
, and initialize it with ?: latentReal
intead of fixedReal(3.5)
- Add a prior to \(K\) (see distributions reference)
- Graphical model of the new model?