STAT 520 - Bayesian Analysis

Alexandre Bouchard-Côté

3/27/2019

Goals

Today:

Logistics

Recall: motivation for decision theory

Recall: I promised Bayesian statistics provides a “meta-recipe” to attack new statistical problems.

Decision theory: the last ingredient that was missing for the meta-recipe

  1. Construct a probability model including
    • random variables for what we will measure/observe
    • random variables for the unknown quantities
      • those we are interested in (“parameters”, “predictions”)
      • others that just help us formulate the problem (“nuisance”, “random effects”).
  2. Compute the posterior distribution conditionally on the actual data at hand
  3. Use the posterior distribution to:
    • make prediction (point estimate)
    • estimate uncertainty (credible intervals)
    • make a decision (e.g. pick website A versus B, fly or ground the rocket, etc)

Recall: setup

What we have so far: joint distribution \(f(z, y)\) where our convention today will be

New today: a mathematical model for decision-making in the face of uncertainty

Preview of what is coming up. The above formalism will allow us to write the most important equation in Bayesian statistic: the Bayes estimator, \(\delta^*(X)\), defined as

\[\delta^*(X) = {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\}\]

Recall: estimators

Recall: evaluation of estimators

Key difference:

The Bayes estimator

So far: abstract definition of Bayes estimators as minimizers of the integrated risk \[ \begin{aligned} \delta^* &= {\textrm{argmin}}_{\delta : {\mathscr{X}}\to {\mathcal{A}}} \{ r(\delta) \} \\ r(\delta) &= {\mathbf{E}}[L(\delta(X), Z)] \end{aligned} \]

More explicit expression: The estimator \(\delta^*\), defined by the equation below, minimizes the integrated risk

\[ \delta^*(X) = {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \]

This estimator \(\delta^*\) is called a Bayes estimator.

This means that given a model and a goal, the Bayesian framework provides in principle a recipe for constructing an estimator.

However, the computation required to implement this recipe may be considerable. This explains why computational statistics plays a large role in Bayesian statistics and in this course.

Examples: from a loss to a Bayes estimator

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[{{\bf 1}}[Z \neq a] | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}(Z \neq a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ 1 - {\mathbb{P}}(Z = a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmax}}\{ {\mathbb{P}}(Z = a | X) : a \in {\mathcal{A}}\} \\ &= \text{MAP estimator} \end{aligned} \]

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[(Z - a)^2 | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[Z^2 | X] - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \end{aligned} \] Now think of \({\mathbf{E}}[Z|X]\) as a constant that you get from the posterior. To minimize the bottom expression, take derivative with respect to \(a\), equate to zero:

\[ \begin{aligned} -2 {\mathbf{E}}[Z|X] + 2a = 0 \end{aligned} \] Hence: here the Bayes estimator is the posterior mean, \(\delta^*(X) = {\mathbf{E}}[Z|X]\).

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z \notin [c, d] | X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z < c|X] + {\mathbb{P}}[Z > d |X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z \le c|X] - {\mathbb{P}}[Z \le d |X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \end{aligned} \] Assuming the posterior has a continuous density \(f\) to change \(<\) into \(\le\). Again we take the derivative with respect to \(c\) and set to zero; then will do the same thing for \(d\). Notice that \({\mathbb{P}}[Z \le c|X]\) is the posterior CDF, so taking the derivative with respect to \(c\) yields a density:

\[ f_{Z|X}(c) - k = 0, \]

so we see the optimum will be the largest \([c, d]\) such that \(f(c) = f(d) = k\). Tuning \(k\) gives us different HDI levels.

Example: clustering, continued

Recall:

\[ \sum_{1\le i < j \le n} {{\bf 1}}[(i \sim_{\rho}j) \neq (i \sim_{{\rho'}}j)], \]

Goal: computing the Bayes estimator derived from the rand loss.

First, we can write:

\[ \begin{aligned} {\textrm{argmin}}_{\textrm{partition }\rho'} {\mathbf{E}}\left[{\textrm{rand}}(\rho, \rho')|X\right] &= {\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} {\mathbf{E}}\left[{{\bf 1}}\left[\rho_{ij} \neq \rho'_{ij}\right]|X\right] \\ &= {\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} \left\{(1-\rho'_{ij}){\mathbb{P}}(\rho_{ij} = 1|X) + \rho'_{ij} \left(1- {\mathbb{P}}(\rho_{ij} = 1 |X)\right)\right\} \end{aligned} \]

where \(\rho_{i,j} = (i \sim_{\rho} j)\), which can be viewed as edge indicators on a graph.

The above identity comes from the fact that \(\rho_{i,j}\) is either one or zero, so:

This means that computing an optimal bipartition of the data into two clusters can be done in two steps:

  1. Simulating a Markov chain, and use the samples to estimate \({s}_{i,j} = {\mathbb{P}}(\rho_{ij} = 1 | Y)\) via Monte Carlo averages.
  2. Minimize the linear objective function \(\sum_{i<j} \left\{(1-\rho_{ij}){s}_{i,j} + \rho_{ij} \left(1- {s}_{i,j}\right)\right\}\) over bipartitions \(\rho\).

Note that the second step can be efficiently computed using min-flow/max-cut algorithms (understanding how this algorithm works is outside of the scope of this lecture, but if you are curious, see CLRS, chapter 26).

Black box optimization of Bayes estimators

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &\approx {\textrm{argmin}}\{ \frac{1}{M} \sum_{i=1}^M L(a, Z_i) : a \in {\mathcal{A}}\} \\ \end{aligned} \]

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &\approx {\textrm{argmin}}\{ \frac{1}{M} \sum_{i=1}^M L(a, Z_i) : a \in \{Z_1, Z_2, \dots, Z_M\} \} \\ \end{aligned} \]

Bayes estimators from a frequentist perspective

Recall: admissibility, a frequentist notion of optimality (or rather, non-sub-optimality).

Proposition: if a Bayes estimator is unique, it is admissible.

To show uniqueness, may try to use convexity of loss function for example.

Discussion on last week example: change point problem / segmentation

drawing

Recall: change point problem

Data looks like this:

drawing

Can you build a Bayesian model for that?

How to pick distributions?

Frequently used distributions

By type:

By interpretation:

Mathematical model

Model selection: examples

Model selection: Notation

\[ m_i = \int p_i(z) \ell_i(z) {\text{d}}z \]

Notation warning: \(m\) is called \(Z\) in a Monte Carlo context, but here \(Z\) will be used for the random variable corresponding to the latent variable \(z\).

Bayesian model selection: key idea

Put a prior \(p\) on \(I\), and make the uncertainty over models part of the probabilistic model.

The new joint probability density is given by:

\[ p((i, z), x) = p(i) p_i(z) \ell_i(x | z), \]

where \((i, z)\) is a member of a new latent space given by:

\[ {\mathscr{Z}}= \bigcup_{i\in I} \left( \{i\} \times {\mathscr{Z}}_i \right), \]

Notation: denote the event that model \(i\) is the model explaining the data by \(M_i\).

Outcome: Using this construction, model choice can in principle be approached using the same methods as those used last week.

Observations

Graphical modelling: \({\mathscr{Z}}\) cannot be directly expressed as a non-trivial graphical model (since it is not a product space). How to transform it into a graphical model?

Non-regularity: even with the reductions introduced so far, model selection justifies special attentions because of non-regularities: the likelihood depends in a non-smooth way upon the model indicator variable. Importantly, different models have different dimensionality of the latent space. We will see that MCMC then requires special techniques called trans-dimensional MCMC.

Bayes factor

Ratio of the marginal likelihood for two models:

\[ B_{12} = \frac{m_1(x)}{m_2(x)} \]

Values of \(B_{12}\) greater than 1.0 favor model #1 over #2. Values smaller than 1.0 favor #2 over #1.

This is just a reparameterization of the Bayes estimator with an asymmetric 0-1 loss. Note that it is different from a likelihood ratio:

\[ \frac{\sup_{z_1} \ell_1(x|z_1)}{\sup_{z_2} \ell_2(x|z_2)}, \]

which is not used within the Bayesian framework.

Computation of Bayes factor

Many approaches exist, with pros and cons, see for example 19 dubious ways to compute the marginal likelihood of a phylogenetic tree topology

I will outline some key methods:

Model saturation

The idea is to build an augmented model, which can be written as a graphical model, and from which we can still approximate \(m_i(x)\).

Construction of the auxiliary latent space:

This creates the following auxiliary latent space:

\[ {\mathscr{Z}}' = \{1, 2\} \times {\mathscr{Z}}_1 \times {\mathscr{Z}}_2. \]

Example: dim(\({\mathscr{Z}}_1\)) = 1, dim(\({\mathscr{Z}}_2\)) = 2.

Construction of the auxiliary joint distribution: suppose the current state is \((\mu, z_1, z_2, x)\). We need to define an auxiliary joint density \(\tilde p(\mu, z_1, z_2, y)\).

The idea is that when \(\mu = 1\), we explain the data \(x\) using \(z_1\), and when \(\mu = 2\), we explain the data \(x\) using \(z_2\).

In notation, if \(\mu = 1\), we set:

\[ \tilde p(\mu, z_1, z_2, x) = p(\mu) p_1(z_1) p_2(z_2) \ell_1(x | z_1), \]

and if \(\mu = 2\),

\[ \tilde p(\mu, z_1, z_2, x) = p(\mu) p_1(z_1) p_2(z_2) \ell_2(x | z_2). \]

Pro: can use existing MCMC methods that do not have marginal likelihood computation methods, such as JAGS/Stan .

Cons: