Recall: motivation for decision theory

Recall: I promised Bayesian statistics provides a “meta-recipe” to attack new statistical problems.

Decision theory: the last ingredient that was missing for the meta-recipe

Construct a probability model including
- random variables for what we will measure/observe
- random variables for the unknown quantities
  - those we are interested in (“parameters”, “predictions”)
  - others that just help us formulate the problem (“nuisance”, “random effects”).
Compute the posterior distribution conditionally on the actual data at hand
Use the posterior distribution to:
- make prediction (point estimate)
- estimate uncertainty (credible intervals)
- make a decision (e.g. pick website A versus B, fly or ground the rocket, etc)

Recall: setup

What we have so far: joint distribution \(f(z, y)\) where our convention today will be

\(Z\) is unknown
\(X\) is observed

New today: a mathematical model for decision-making in the face of uncertainty

a set of actions \({\mathcal{A}}\) (each action \(a\) could be a prediction, a decision, etc)
a loss function \(L(a, z)\), which specifies a cost incurred for picking action \(a\) when the true state of the world is \(z\)

Preview of what is coming up. The above formalism will allow us to write the most important equation in Bayesian statistic: the Bayes estimator, \(\delta^*(X)\), defined as

\[\delta^*(X) = {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\}\]

Recall: estimators

We want to devise a decision-making strategy, which we formalize as an estimator:
- a function that take as input only the observations, \(\delta(x)\), and output a proposed action, \(\delta(x) \in {\mathcal{A}}\).
- i.e., \(\delta : {\mathscr{X}}\to {\mathcal{A}}\)
We want this estimator to be as “good” as possible.
- Under a certain criterion of goodness, we will see that the Bayesian framework provides a principled and systematic way of specifying a “best” estimator.

Recall: evaluation of estimators

Frequentist risk: view \(\theta = z\) as parameters for a likelihood / indexing probabilities over observables \(\{{\mathbb{P}}_z\}\), with a corresponding collections of expectation operators \(\{{\mathbf{E}}_z\}\), \[ \begin{aligned} R(z, \delta) &= {\mathbf{E}}_z[L(\delta(X), z)] \\ &= \int L(\delta(x), z)\ \text{likelihood}(x | z) {\text{d}}x \end{aligned} \]
Bayesian notion: integrated risk \[ \begin{aligned} r(\delta) &= {\mathbf{E}}[L(\delta(X), Z)] \\ &= \int \int L(\delta(x), z)\ \text{prior}(z)\ \text{likelihood}(x | z)\ {\text{d}}x {\text{d}}z \end{aligned} \]

Key difference:

Frequentist risk: a partial order on estimators
- Only canonical notion of optimality is then non-dominance, called (statistical) efficiency
Bayes risk: a complete order on estimators (under weak conditions)
- Can actually get an expression for that optimal estimator
- As a bonus, also satisfies the frequentist notion of non-sub-optimality: Bayes estimators are efficient under weak conditions (more on this later)

The Bayes estimator

So far: abstract definition of Bayes estimators as minimizers of the integrated risk \[ \begin{aligned} \delta^* &= {\textrm{argmin}}_{\delta : {\mathscr{X}}\to {\mathcal{A}}} \{ r(\delta) \} \\ r(\delta) &= {\mathbf{E}}[L(\delta(X), Z)] \end{aligned} \]

More explicit expression: The estimator \(\delta^*\), defined by the equation below, minimizes the integrated risk

\[ \delta^*(X) = {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \]

This estimator \(\delta^*\) is called a Bayes estimator.

This means that given a model and a goal, the Bayesian framework provides in principle a recipe for constructing an estimator.

However, the computation required to implement this recipe may be considerable. This explains why computational statistics plays a large role in Bayesian statistics and in this course.

Examples: from a loss to a Bayes estimator

zero-one loss, continued

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[{{\bf 1}}[Z \neq a] | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}(Z \neq a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ 1 - {\mathbb{P}}(Z = a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmax}}\{ {\mathbb{P}}(Z = a | X) : a \in {\mathcal{A}}\} \\ &= \text{MAP estimator} \end{aligned} \]

squared loss, continued

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[(Z - a)^2 | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[Z^2 | X] - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \end{aligned} \] Now think of \({\mathbf{E}}[Z|X]\) as a constant that you get from the posterior. To minimize the bottom expression, take derivative with respect to \(a\), equate to zero:

\[ \begin{aligned} -2 {\mathbf{E}}[Z|X] + 2a = 0 \end{aligned} \] Hence: here the Bayes estimator is the posterior mean, \(\delta^*(X) = {\mathbf{E}}[Z|X]\).

credible interval from the Bayes estimator: let us do it for a HDI. Recall, \({\mathcal{A}}= \{[c, d] : c < d\}\), and consider the loss function given by \[ L([c, d], z) = {{\bf 1}}\{z \notin [c, d]\} + k (d - c) \] for some tuning parameter \(k\). We get:

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z \notin [c, d] | X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z < c|X] + {\mathbb{P}}[Z > d |X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z \le c|X] - {\mathbb{P}}[Z \le d |X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \end{aligned} \] Assuming the posterior has a continuous density \(f\) to change \(<\) into \(\le\). Again we take the derivative with respect to \(c\) and set to zero; then will do the same thing for \(d\). Notice that \({\mathbb{P}}[Z \le c|X]\) is the posterior CDF, so taking the derivative with respect to \(c\) yields a density:

\[ f_{Z|X}(c) - k = 0, \]

so we see the optimum will be the largest \([c, d]\) such that \(f(c) = f(d) = k\). Tuning \(k\) gives us different HDI levels.

Exercise question: pessimistic prediction
- Goal: estimate a success probability \(p\)
- But we want to heavily penalize making a prediction \(\hat p\) greater than \(p\), even if it is just a little bit
- Similar situation: “the price is right”
- Write a loss
- Derive a Bayes estimator

Example: clustering, continued

Recall:

True partitions \({\rho}\)
Guessed partition \({{\rho'}}\)
rand loss function is denotes as \({\textrm{rand}}({\rho}, {{\rho'}})\).

\[ \sum_{1\le i < j \le n} {{\bf 1}}[(i \sim_{\rho}j) \neq (i \sim_{{\rho'}}j)], \]

Goal: computing the Bayes estimator derived from the rand loss.

First, we can write:

\[ \begin{aligned} {\textrm{argmin}}_{\textrm{partition }\rho'} {\mathbf{E}}\left[{\textrm{rand}}(\rho, \rho')|X\right] &= {\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} {\mathbf{E}}\left[{{\bf 1}}\left[\rho_{ij} \neq \rho'_{ij}\right]|X\right] \\ &= {\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} \left\{(1-\rho'_{ij}){\mathbb{P}}(\rho_{ij} = 1|X) + \rho'_{ij} \left(1- {\mathbb{P}}(\rho_{ij} = 1 |X)\right)\right\} \end{aligned} \]

where \(\rho_{i,j} = (i \sim_{\rho} j)\), which can be viewed as edge indicators on a graph.

The above identity comes from the fact that \(\rho_{i,j}\) is either one or zero, so:

the first term in the the brackets of the above equation corresponds to the edges not in the partition \(\rho\) (for which we are penalized if the posterior probability of the edge is large), and
the second term in the same brackets corresponds to the edges in the partition \(\rho\) (for which we are penalized if the posterior probability of the edge is small).

This means that computing an optimal bipartition of the data into two clusters can be done in two steps:

Simulating a Markov chain, and use the samples to estimate \({s}_{i,j} = {\mathbb{P}}(\rho_{ij} = 1 | Y)\) via Monte Carlo averages.
Minimize the linear objective function \(\sum_{i<j} \left\{(1-\rho_{ij}){s}_{i,j} + \rho_{ij} \left(1- {s}_{i,j}\right)\right\}\) over bipartitions \(\rho\).

Note that the second step can be efficiently computed using min-flow/max-cut algorithms (understanding how this algorithm works is outside of the scope of this lecture, but if you are curious, see CLRS, chapter 26).

Black box optimization of Bayes estimators

Objective function from \(M\) Monte Carlo samples:

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &\approx {\textrm{argmin}}\{ \frac{1}{M} \sum_{i=1}^M L(a, Z_i) : a \in {\mathcal{A}}\} \\ \end{aligned} \]

Idea that could be part of a project: stochastic gradient meets Bayes estimators
If \({\mathcal{A}}\) is tricky to explore (combinatorial, constrained such as the motivating tracking problem, etc), and \({\mathcal{A}}= \{z \in {\mathscr{Z}}\}\) can further approximate both the objective and constraints as follows:

Idea that could be part of a project: Bayes estimator for feature matrices

Bayes estimators from a frequentist perspective

Recall: admissibility, a frequentist notion of optimality (or rather, non-sub-optimality).

An estimator \(\delta\) is admissible if there are no dominating estimator \(\delta'\)
Domination here under the frequentist risk \(R(z, \delta) = {\mathbf{E}}_z[L(\delta(X), z)]\),
i.e. \(\delta\) is admissible if there is no \(\delta'\) such that for all \(z\), \(R(z, \delta') < R(z, \delta)\)

Proposition: if a Bayes estimator is unique, it is admissible.

To show uniqueness, may try to use convexity of loss function for example.

Discussion on last week example: change point problem / segmentation

In a healthy cell, chromosomes (long strands of DNA) come in pairs (except for the sex chromosomes).
In certain cancers, some parts of the chromosome come to lose one of the two copies.
Suppose we segment one (non-sex) chromosome into segments \(0, 1, 2, \dots, 49\).
For each segment, let \(S_i\) encodes the number of copies in this segment.
Examples of possible realizations for the copy numbers:

drawing

Recall: change point problem

Key point: if a segment has two copies, there should be on average twice as many reads mapped into it compared to a segment with only one copy.
Let \(K\) be the average number of reads we get per segment and per chromosome.
- Could be determined experimentally or inferred, let say we known \(K=3.5\).

Data looks like this:

drawing

Can you build a Bayesian model for that?

How to pick distributions?

Use the type of the realization to narrow down the possibilities
- Example: for the Poisson emissions, we knew the support had to be \(\{0, 1, 2, 3, \dots\}\); this excludes Bernoulli and Uniform.
Exploratory data analysis
Ask experts/check literature if possible
Theoretical grounds (asympotic theory, mechanistic models, etc)
- Example: “law of rare events” provides some interpretation
Try several methods and use Bayesian model selection techniques (next week, if time permits)

Frequently used distributions

By type:

Binary data types
- Bernoulli (by definition)
Enumeration data types
- Categorical (by definition)
Integers \(\ge 0\)
- Geometric (mode at zero)
- Poisson (mode close to mean)
- For more detailed models, Negative Binomial (2 parameters)
Continuous \(\ge 0\):
- Exponential (mode at zero)
- For more detailed models, Gamma
Continuous in \([0, 1]\):
- Continuous Uniform
- For more detailed models, Beta
Continuous on real line:
- Normal

By interpretation:

Sum of Bernoullis:
- Binomial
Population survey:
- Hyper Geometric
Waiting times:
- Geometric, Negative Binomial

Mathematical model

Let \(S_0, S_1, \dots, S_{49}\) denote a list of random variable modelling the unobserved states.
- Each state takes two possible values (the “copy number state”).
- We want to encourage a priori the states to “stick” to the same value (i.e. the copy number does not change too often as we move along the genome), so we set \[P(S_i = s | S_{i-1} = s) = 0.95\]
- Exercise (part 1): how to write this in our mathematical notation, \(S_i | S_{i-1} \sim ???\) (hint: use Bernoullis and change the encoding of the states as 0 and 1 instead of 1 and 2)
- Exercise (part 2): write a JAGS model (hint: check “ifelse(x, a, b)” in the JAGS manual
- Also have a look at the Blang model for comparison
Let \(Y_0, Y_1, \dots, Y_{49}\) denote the observations (number of reads mapped for each of the 50 segments).
We will use the following model (a common choice in bioinformatics): \[Y_i | S_i \sim \text{Poisson}(3.5 (S_i + 1))\]
Recall: Poisson distribution
Exercise (part 3): draw graphical model
Recall \(K\) is the average number of reads we get per segment and per chromosome.
- How to proceed is \(K\) is not known?
- Part 4: Implement this idea in Blang as follows:
  - In the variable declarations, make \(K\) random instead of param, and initialize it with ?: latentReal intead of fixedReal(3.5)
  - Add a prior to \(K\) (see distributions reference)
- Graphical model of the new model?

Model selection: examples

Choice in the likelihood: should we replace the logit transform in the Challenger example by a probit transformation?
Choice in the prior: in the challenger example, should we replace the normal distribution by a t distribution?
Variable (covariate) selection: should we use only the temperature covariate, or both temperature and humidity, or only humidity, to predict o-ring failure?

Model selection: Notation

\(I\): an index over a discrete set of models.
\({\mathscr{Z}}_i\) for \(i\in I\): latent space for model \(i\), i.e. \(z \in {\mathscr{Z}}_i\) when model \(i\) is selected.
\(p_i\), \(\ell_i\), \(m_i\): prior, likelihood, and marginal likelihood densities for model \(i\), i.e.

\[ m_i = \int p_i(z) \ell_i(z) {\text{d}}z \]

Notation warning: \(m\) is called \(Z\) in a Monte Carlo context, but here \(Z\) will be used for the random variable corresponding to the latent variable \(z\).

Bayesian model selection: key idea

Put a prior \(p\) on \(I\), and make the uncertainty over models part of the probabilistic model.

The new joint probability density is given by:

\[ p((i, z), x) = p(i) p_i(z) \ell_i(x | z), \]

where \((i, z)\) is a member of a new latent space given by:

\[ {\mathscr{Z}}= \bigcup_{i\in I} \left( \{i\} \times {\mathscr{Z}}_i \right), \]

Notation: denote the event that model \(i\) is the model explaining the data by \(M_i\).

Outcome: Using this construction, model choice can in principle be approached using the same methods as those used last week.

Observations

Graphical modelling: \({\mathscr{Z}}\) cannot be directly expressed as a non-trivial graphical model (since it is not a product space). How to transform it into a graphical model?

Non-regularity: even with the reductions introduced so far, model selection justifies special attentions because of non-regularities: the likelihood depends in a non-smooth way upon the model indicator variable. Importantly, different models have different dimensionality of the latent space. We will see that MCMC then requires special techniques called trans-dimensional MCMC.

Bayes factor

Ratio of the marginal likelihood for two models:

\[ B_{12} = \frac{m_1(x)}{m_2(x)} \]

Values of \(B_{12}\) greater than 1.0 favor model #1 over #2. Values smaller than 1.0 favor #2 over #1.

This is just a reparameterization of the Bayes estimator with an asymmetric 0-1 loss. Note that it is different from a likelihood ratio:

\[ \frac{\sup_{z_1} \ell_1(x|z_1)}{\sup_{z_2} \ell_2(x|z_2)}, \]

which is not used within the Bayesian framework.

Computation of Bayes factor

In standard Bayesian problems, we need a list of MCMC samples, each of which might be high dimensional.
Here we only need to compute one scalar for each model, that should be easy… right?
Surprisingly, this is generally considered harder than getting samples from the posterior distribution.

Many approaches exist, with pros and cons, see for example 19 dubious ways to compute the marginal likelihood of a phylogenetic tree topology

I will outline some key methods:

Model saturation
Sequential Monte Carlo via change of measure
Thermodynamic integration
Reversible Jump MCMC

Model saturation

The idea is to build an augmented model, which can be written as a graphical model, and from which we can still approximate \(m_i(x)\).

Construction of the auxiliary latent space:

Instead of defining the global latent space as a union of each model’s latent space, define it as a product space,
and add to that an indicator \(\mu\) that selects which model to use to explain the data. The event \(M_1\) corresponds to \(\mu = 1\) and \(M_2\), to \(\mu = 2\).

This creates the following auxiliary latent space:

\[ {\mathscr{Z}}' = \{1, 2\} \times {\mathscr{Z}}_1 \times {\mathscr{Z}}_2. \]

Example: dim(\({\mathscr{Z}}_1\)) = 1, dim(\({\mathscr{Z}}_2\)) = 2.

Construction of the auxiliary joint distribution: suppose the current state is \((\mu, z_1, z_2, x)\). We need to define an auxiliary joint density \(\tilde p(\mu, z_1, z_2, y)\).

The idea is that when \(\mu = 1\), we explain the data \(x\) using \(z_1\), and when \(\mu = 2\), we explain the data \(x\) using \(z_2\).

In notation, if \(\mu = 1\), we set:

\[ \tilde p(\mu, z_1, z_2, x) = p(\mu) p_1(z_1) p_2(z_2) \ell_1(x | z_1), \]

and if \(\mu = 2\),

\[ \tilde p(\mu, z_1, z_2, x) = p(\mu) p_1(z_1) p_2(z_2) \ell_2(x | z_2). \]

Pro: can use existing MCMC methods that do not have marginal likelihood computation methods, such as JAGS/Stan .

Cons:

Limited to finite collections of models, \(|I| < \infty\).
Computational cost does not scale well in \(|I|\).

STAT 520 - Bayesian Analysis

Goals

Logistics