Decision theory: general motivation

Recall: I promised Bayesian statistics provides a “meta-recipe” to attack new statistical problems.

Decision theory: the last ingredient that was missing for the meta-recipe

Construct a probability model including
- random variables for what we will measure/observe
- random variables for the unknown quantities
  - those we are interested in (“parameters”, “predictions”)
  - others that just help us formulate the problem (“nuisance”, “random effects”).
Compute the posterior distribution conditionally on the actual data at hand
Use the posterior distribution to:
- make prediction (point estimate)
- estimate uncertainty (credible intervals)
- make a decision (e.g. pick website A versus B, fly or ground the rocket, etc)

Setup

What we have so far: joint distribution \(f(z, y)\) where our convention today will be

\(Z\) is unknown
\(X\) is observed

New today: a mathematical model for decision-making in the face of uncertainty

a set of actions \({\mathcal{A}}\) (each action \(a\) could be a prediction, a decision, etc)
a loss function \(L(a, z)\), which specifies a cost incurred for picking action \(a\) when the true state of the world is \(z\)

Preview of what is coming up. The above formalism will allow us to write the most important equation in Bayesian statistic: the Bayes estimator, \(\delta^*(X)\), defined as

\[\delta^*(X) = {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\}\]

Examples of actions

for prediction/point estimation, \({\mathcal{A}}\) is just the state space of one of the random variables in the graphical model
- to predict outcome of next launch, \({\mathcal{A}}= \{\)success, failure\(\}\)
- to predict failure probability \({\mathcal{A}}= [0, 1]\)
- sometimes, we want to predict all the unknown variables, \({\mathcal{A}}= \{z \in {\mathcal{Z}}\}\)
for credible regions, \({\mathcal{A}}\) is the set of subsets of the state space of one of the random variables
- for a single interval on a real random variable, \({\mathcal{A}}= \{[c, d] : c < d\}\)
but actions are more general
- Clinical setting: which treatment/diagnostic to do next?
- A/B testing example: use version A or B?
- but we saw last time it was possible to “augment” the graphical model so that this choice becomes a variable in the augmented model

Examples of loss functions

for real numbers \(a, z\), \(L(a, z) = (a - z)^2\), the squared loss
for binary predictions, \(L(a, z) = {{\bf 1}}[a \neq z]\)

A more interesting example: clustering analysis

In clustering analysis, the objective is to find a partition \(\rho\) which divides the indices of the datapoints into “blocks” or “clusters”.
- The points in the first cluster are explained using a first parametric model, the points in the second cluster are explained using a second parametric model, etc.
- Here we will focus on a loss function for clustering rather than the probabilistic model. We will come back to the probabilistic model for clustering next week.
- For simplicity, let us consider today clustering into two clusters

Popular loss function for clustering: the rand loss.

True partitions \({\rho}\)
Guessed partition \({{\rho'}}\)
rand loss function is denotes as \({\textrm{rand}}({\rho}, {{\rho'}})\).
Note that we turn the standard notion of rand index into a loss by taking 1 - the rand index.

Definition: The rand loss is defined as the number of (unordered) pairs of data points indices \(\{i,j\}\) such that \((i \sim_{{\rho}} j) \neq (i \sim_{{{\rho'}}} j)\), i.e.:

\[ \sum_{1\le i < j \le n} {{\bf 1}}[(i \sim_{\rho}j) \neq (i \sim_{{\rho'}}j)], \] where \((i \sim_\rho j) = 1\) if there is a \(B\in\rho\) s.t. \(_{i,j_}\subseteq B\), and \((i \sim_\rho j) = 0\) otherwise.

In other words, a loss of one is incurred each time either: (1) two points are assigned to the same cluster when they belong in separate clusters, or (2) two points are assigned to different clusters when they should be in the same cluster.

NB: The rand loss has several problems, motivating other clustering losses such as the adjusted rand index, but we will look at the rand loss here since the derivation of the Bayes estimator is easy for that particular loss. (See Fritsch and Ickstadt, 2009 for discussion on other losses and how to approach the Bayes estimator optimization problem for these other losses.)

Estimators

We want to devise a decision-making strategy, which we formalize as an estimator:
- a function that take as input only the observations, \(\delta(x)\), and output a proposed action, \(\delta(x) \in {\mathcal{A}}\).
- i.e., \(\delta : {\mathscr{X}}\to {\mathcal{A}}\)
We want this estimator to be as “good” as possible.
- Under a certain criterion of goodness, we will see that the Bayesian framework provides a principled and systematic way of specifying a “best” estimator.

Evaluation of estimators

Frequentist risk: view \(\theta = z\) as parameters for a likelihood / indexing probabilities over observables \(\{{\mathbb{P}}_z\}\), with a corresponding collections of expectation operators \(\{{\mathbf{E}}_z\}\), \[ \begin{aligned} R(z, \delta) &= {\mathbf{E}}_z[L(\delta(X), z)] \\ &= \int L(\delta(x), z)\ \text{likelihood}(x | z) {\text{d}}x \end{aligned} \]
Bayesian notion: integrated risk \[ \begin{aligned} r(\delta) &= {\mathbf{E}}[L(\delta(X), Z)] \\ &= \int \int L(\delta(x), z)\ \text{prior}(z)\ \text{likelihood}(x | z)\ {\text{d}}x {\text{d}}z \end{aligned} \]

Key difference:

Frequentist risk: a partial order on estimators
- Only canonical notion of optimality is then non-dominance, called (statistical) efficiency
Bayes risk: a complete order on estimators (under weak conditions)
- Can actually get an expression for that optimal estimator
- As a bonus, also satisfies the frequentist notion of non-sub-optimality: Bayes estimators are efficient under weak conditions (more on this later)

The Bayes estimator

So far: abstract definition of Bayes estimators as minimizers of the integrated risk \[ \begin{aligned} \delta^* &= {\textrm{argmin}}_{\delta : {\mathscr{X}}\to {\mathcal{A}}} \{ r(\delta) \} \\ r(\delta) &= {\mathbf{E}}[L(\delta(X), Z)] \end{aligned} \]

More explicit expression: The estimator \(\delta^*\), defined by the equation below, minimizes the integrated risk

\[ \delta^*(X) = {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \]

This estimator \(\delta^*\) is called a Bayes estimator.

This means that given a model and a goal, the Bayesian framework provides in principle a recipe for constructing an estimator.

However, the computation required to implement this recipe may be considerable. This explains why computational statistics plays a large role in Bayesian statistics and in this course.

Examples: from a loss to a Bayes estimator

zero-one loss, continued

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[{{\bf 1}}[Z \neq a] | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}(Z \neq a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ 1 - {\mathbb{P}}(Z = a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmax}}\{ {\mathbb{P}}(Z = a | X) : a \in {\mathcal{A}}\} \\ &= \text{MAP estimator} \end{aligned} \]

squared loss, continued

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[(Z - a)^2 | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[Z^2 | X] - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \end{aligned} \] Now think of \({\mathbf{E}}[Z|X]\) as a constant that you get from the posterior. To minimize the bottom expression, take derivative with respect to \(a\), equate to zero:

\[ \begin{aligned} -2 {\mathbf{E}}[Z|X] + 2a = 0 \end{aligned} \] Hence: here the Bayes estimator is the posterior mean, \(\delta^*(X) = {\mathbf{E}}[Z|X]\).

credible interval from the Bayes estimator: let us do it for a HDI. Recall, \({\mathcal{A}}= \{[a, b] : c < d\}\), and consider the loss function given by \[ L([c, d], z) = {{\bf 1}}\{z \notin [c, d]\} + k (d - c) \] for some tuning parameter \(k\). We get:

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z \notin [c, d] | X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z < c|X] + {\mathbb{P}}[Z > d |X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z \le c|X] - {\mathbb{P}}[Z \le d |X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \end{aligned} \] Assuming the posterior has a continuous density \(f\) to change \(<\) into \(\le\). Again we take the derivative with respect to \(c\) and set to zero; then will do the same thing for \(d\). Notice that \({\mathbb{P}}[Z \le c|X]\) is the posterior CDF, so taking the derivative with respect to \(c\) yields a density:

\[ f_{Z|X}(c) - k = 0, \]

so we see the optimum will be the largest \([c, d]\) such that \(f(c) = f(d) = k\). Tuning \(k\) gives us different HDI levels.

Exercise question: pessimistic prediction
- Goal: estimate a success probability \(p\)
- But we want to heavily penalize making a prediction \(\hat p\) greater than \(p\), even if it is just a little bit
- Similar situation: “the price is right”
- Write a loss
- Derive a Bayes estimator

Example: clustering, continued

Goal: computing the Bayes estimator derived from the rand loss. \[ \sum_{1\le i < j \le n} {{\bf 1}}[(i \sim_{\rho}j) \neq (i \sim_{{\rho'}}j)], \]

First, we can write:

\[ \begin{aligned} {\textrm{argmin}}_{\textrm{partition }\rho'} {\mathbf{E}}\left[{\textrm{rand}}(\rho, \rho')|X\right] &= {\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} {\mathbf{E}}\left[{{\bf 1}}\left[\rho_{ij} \neq \rho'_{ij}\right]|X\right] \\ &= {\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} \left\{(1-\rho'_{ij}){\mathbb{P}}(\rho_{ij} = 1|X) + \rho'_{ij} \left(1- {\mathbb{P}}(\rho_{ij} = 1 |X)\right)\right\} \end{aligned} \]

where \(\rho_{i,j} = (i \sim_{\rho} j)\), which can be viewed as edge indicators on a graph.

The above identity comes from the fact that \(\rho_{i,j}\) is either one or zero, so:

the first term in the the brackets of the above equation corresponds to the edges not in the partition \(\rho\) (for which we are penalized if the posterior probability of the edge is large), and
the second term in the same brackets corresponds to the edges in the partition \(\rho\) (for which we are penalized if the posterior probability of the edge is small).

This means that computing an optimal bipartition of the data into two clusters can be done in two steps:

Simulating a Markov chain, and use the samples to estimate \({s}_{i,j} = {\mathbb{P}}(\rho_{ij} = 1 | Y)\) via Monte Carlo averages.
Minimize the linear objective function \(\sum_{i<j} \left\{(1-\rho_{ij}){s}_{i,j} + \rho_{ij} \left(1- {s}_{i,j}\right)\right\}\) over bipartitions \(\rho\).

Note that the second step can be efficiently computed using min-flow/max-cut algorithms (understanding how this algorithm works is outside of the scope of this lecture, but if you are curious, see CLRS, chapter 26).

Black box optimization

Objective function from \(M\) Monte Carlo samples:

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &\approx {\textrm{argmin}}\{ \frac{1}{M} \sum_{i=1}^M L(a, Z_i) : a \in {\mathcal{A}}\} \\ \end{aligned} \]

Idea that could be part of a project: stochastic gradient meets Bayes estimators
If \({\mathcal{A}}\) is tricky to explore (combinatorial, constrained such as the motivating tracking problem, etc), and \({\mathcal{A}}= \{z \in {\mathcal{Z}}\}\) can further approximate both the objective and constraints as follows:

Idea that could be part of a project: Bayes estimator for feature matrices

Bayes estimators from a frequentist perspective

Recall: admissibility, a frequentist notion of optimality (or rather, non-sub-optimality).

An estimator \(\delta\) is admissible if there are no dominating estimator \(\delta'\)
Domination here under the frequentist risk \(R(z, \delta) = {\mathbf{E}}_z[L(\delta(X), z)]\),
i.e. \(\delta\) is admissible if there is no \(\delta'\) such that for all \(z\), \(R(z, \delta') < R(z, \delta)\)

Proposition: if a Bayes estimator is unique, it is admissible.

To show uniqueness, may try to use convexity of loss function for example.

Discussion on last week example: change point problem / segmentation

In a healthy cell, chromosomes (long strands of DNA) come in pairs (except for the sex chromosomes).
In certain cancers, some parts of the chromosome come to lose one of the two copies.
Suppose we segment one (non-sex) chromosome into segments \(0, 1, 2, \dots, 49\).
For each segment, let \(S_i\) encodes the number of copies in this segment.
Examples of possible realizations for the copy numbers:

drawing

Change point problem

Goal of inference: copy numbers, which are hard to observe directly.
Instead, use sequencing data.
- DNA of many cells is pooled and cut into small pieces
- sequencing machines read some of these small pieces and map into one of the 50 segments.
Key point: if a segment has two copies, there should be on average twice as many reads mapped into it compared to a segment with only one copy.
Let \(K\) be the average number of reads we get per segment and per chromosome.
- Could be determined experimentally or inferred, let say we known \(K=3.5\).

Data looks like this:

drawing

Can you build a Bayesian model for that?

How to pick distributions?

Use the type of the realization to narrow down the possibilities
- Example: for the Poisson emissions, we knew the support had to be \(\{0, 1, 2, 3, \dots\}\); this excludes Bernoulli and Uniform.
Exploratory data analysis
Ask experts/check literature if possible
Theoretical grounds (asympotic theory, mechanistic models, etc)
- Example: “law of rare events” provides some interpretation
Try several methods and use Bayesian model selection techniques (next week, if time permits)

Frequently used distributions

By type:

Binary data types
- Bernoulli (by definition)
Enumeration data types
- Categorical (by definition)
Integers \(\ge 0\)
- Geometric (mode at zero)
- Poisson (mode close to mean)
- For more detailed models, Negative Binomial (2 parameters)
Continuous \(\ge 0\):
- Exponential (mode at zero)
- For more detailed models, Gamma
Continuous in \([0, 1]\):
- Continuous Uniform
- For more detailed models, Beta
Continuous on real line:
- Normal

By interpretation:

Sum of Bernoullis:
- Binomial
Population survey:
- Hyper Geometric
Waiting times:
- Geometric, Negative Binomial

Mathematical model

Let \(S_0, S_1, \dots, S_{49}\) denote a list of random variable modelling the unobserved states.
- Each state takes two possible values (the “copy number state”).
- We want to encourage a priori the states to “stick” to the same value (i.e. the copy number does not change too often as we move along the genome), so we set \[P(S_i = s | S_{i-1} = s) = 0.95\]
- Exercise (part 1): how to write this in our mathematical notation, \(S_i | S_{i-1} \sim ???\) (hint: use Bernoullis and change the encoding of the states as 0 and 1 instead of 1 and 2)
- Exercise (part 2): write a JAGS model (hint: check “ifelse(x, a, b)” in the JAGS manual
- Also have a look at the Blang model for comparison
Let \(Y_0, Y_1, \dots, Y_{49}\) denote the observations (number of reads mapped for each of the 50 segments).
We will use the following model (a common choice in bioinformatics): \[Y_i | S_i \sim \text{Poisson}(3.5 (S_i + 1))\]
Recall: Poisson distribution
Exercise (part 3): draw graphical model
Recall \(K\) is the average number of reads we get per segment and per chromosome.
- How to proceed is \(K\) is not known?
- Part 4: Implement this idea in Blang as follows:
  - In the variable declarations, make \(K\) random instead of param, and initialize it with ?: latentReal intead of fixedReal(3.5)
  - Add a prior to \(K\) (see distributions reference)
- Graphical model of the new model?

STAT 520 - Bayesian Analysis

Goals

Logistics

Decision theory: motivation (from Marie’s group research)