Alexandre Bouchard-Côté
3/27/2019
Today:
Recall: I promised Bayesian statistics provides a “meta-recipe” to attack new statistical problems.
Decision theory: the last ingredient that was missing for the meta-recipe
What we have so far: joint distribution \(f(z, y)\) where our convention today will be
New today: a mathematical model for decision-making in the face of uncertainty
Preview of what is coming up. The above formalism will allow us to write the most important equation in Bayesian statistic: the Bayes estimator, \(\delta^*(X)\), defined as
\[\delta^*(X) = {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\}\]
Frequentist risk: view \(\theta = z\) as parameters for a likelihood / indexing probabilities over observables \(\{{\mathbb{P}}_z\}\), with a corresponding collections of expectation operators \(\{{\mathbf{E}}_z\}\), \[ \begin{aligned} R(z, \delta) &= {\mathbf{E}}_z[L(\delta(X), z)] \\ &= \int L(\delta(x), z)\ \text{likelihood}(x | z) {\text{d}}x \end{aligned} \]
Bayesian notion: integrated risk \[ \begin{aligned} r(\delta) &= {\mathbf{E}}[L(\delta(X), Z)] \\ &= \int \int L(\delta(x), z)\ \text{prior}(z)\ \text{likelihood}(x | z)\ {\text{d}}x {\text{d}}z \end{aligned} \]
Key difference:
So far: abstract definition of Bayes estimators as minimizers of the integrated risk \[ \begin{aligned} \delta^* &= {\textrm{argmin}}_{\delta : {\mathscr{X}}\to {\mathcal{A}}} \{ r(\delta) \} \\ r(\delta) &= {\mathbf{E}}[L(\delta(X), Z)] \end{aligned} \]
More explicit expression: The estimator \(\delta^*\), defined by the equation below, minimizes the integrated risk
\[ \delta^*(X) = {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \]
This estimator \(\delta^*\) is called a Bayes estimator.
This means that given a model and a goal, the Bayesian framework provides in principle a recipe for constructing an estimator.
However, the computation required to implement this recipe may be considerable. This explains why computational statistics plays a large role in Bayesian statistics and in this course.
\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[{{\bf 1}}[Z \neq a] | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}(Z \neq a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ 1 - {\mathbb{P}}(Z = a | X) : a \in {\mathcal{A}}\} \\ &= {\textrm{argmax}}\{ {\mathbb{P}}(Z = a | X) : a \in {\mathcal{A}}\} \\ &= \text{MAP estimator} \end{aligned} \]
\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[(Z - a)^2 | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[Z^2 | X] - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \end{aligned} \] Now think of \({\mathbf{E}}[Z|X]\) as a constant that you get from the posterior. To minimize the bottom expression, take derivative with respect to \(a\), equate to zero:
\[ \begin{aligned} -2 {\mathbf{E}}[Z|X] + 2a = 0 \end{aligned} \] Hence: here the Bayes estimator is the posterior mean, \(\delta^*(X) = {\mathbf{E}}[Z|X]\).
\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z \notin [c, d] | X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z < c|X] + {\mathbb{P}}[Z > d |X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z \le c|X] - {\mathbb{P}}[Z \le d |X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \end{aligned} \] Assuming the posterior has a continuous density \(f\) to change \(<\) into \(\le\). Again we take the derivative with respect to \(c\) and set to zero; then will do the same thing for \(d\). Notice that \({\mathbb{P}}[Z \le c|X]\) is the posterior CDF, so taking the derivative with respect to \(c\) yields a density:
\[ f_{Z|X}(c) - k = 0, \]
so we see the optimum will be the largest \([c, d]\) such that \(f(c) = f(d) = k\). Tuning \(k\) gives us different HDI levels.
Recall:
\[ \sum_{1\le i < j \le n} {{\bf 1}}[(i \sim_{\rho}j) \neq (i \sim_{{\rho'}}j)], \]
Goal: computing the Bayes estimator derived from the rand loss.
First, we can write:
\[ \begin{aligned} {\textrm{argmin}}_{\textrm{partition }\rho'} {\mathbf{E}}\left[{\textrm{rand}}(\rho, \rho')|X\right] &= {\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} {\mathbf{E}}\left[{{\bf 1}}\left[\rho_{ij} \neq \rho'_{ij}\right]|X\right] \\ &= {\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} \left\{(1-\rho'_{ij}){\mathbb{P}}(\rho_{ij} = 1|X) + \rho'_{ij} \left(1- {\mathbb{P}}(\rho_{ij} = 1 |X)\right)\right\} \end{aligned} \]
where \(\rho_{i,j} = (i \sim_{\rho} j)\), which can be viewed as edge indicators on a graph.
The above identity comes from the fact that \(\rho_{i,j}\) is either one or zero, so:
This means that computing an optimal bipartition of the data into two clusters can be done in two steps:
Note that the second step can be efficiently computed using min-flow/max-cut algorithms (understanding how this algorithm works is outside of the scope of this lecture, but if you are curious, see CLRS, chapter 26).
\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &\approx {\textrm{argmin}}\{ \frac{1}{M} \sum_{i=1}^M L(a, Z_i) : a \in {\mathcal{A}}\} \\ \end{aligned} \]
Idea that could be part of a project: stochastic gradient meets Bayes estimators
If \({\mathcal{A}}\) is tricky to explore (combinatorial, constrained such as the motivating tracking problem, etc), and \({\mathcal{A}}= \{z \in {\mathscr{Z}}\}\) can further approximate both the objective and constraints as follows:
\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &\approx {\textrm{argmin}}\{ \frac{1}{M} \sum_{i=1}^M L(a, Z_i) : a \in \{Z_1, Z_2, \dots, Z_M\} \} \\ \end{aligned} \]
Recall: admissibility, a frequentist notion of optimality (or rather, non-sub-optimality).
Proposition: if a Bayes estimator is unique, it is admissible.
To show uniqueness, may try to use convexity of loss function for example.
Data looks like this:
Can you build a Bayesian model for that?
By type:
By interpretation:
random
instead of param
, and initialize it with ?: latentReal
intead of fixedReal(3.5)
\[ m_i = \int p_i(z) \ell_i(z) {\text{d}}z \]
Notation warning: \(m\) is called \(Z\) in a Monte Carlo context, but here \(Z\) will be used for the random variable corresponding to the latent variable \(z\).
Put a prior \(p\) on \(I\), and make the uncertainty over models part of the probabilistic model.
The new joint probability density is given by:
\[ p((i, z), x) = p(i) p_i(z) \ell_i(x | z), \]
where \((i, z)\) is a member of a new latent space given by:
\[ {\mathscr{Z}}= \bigcup_{i\in I} \left( \{i\} \times {\mathscr{Z}}_i \right), \]
Notation: denote the event that model \(i\) is the model explaining the data by \(M_i\).
Outcome: Using this construction, model choice can in principle be approached using the same methods as those used last week.
Graphical modelling: \({\mathscr{Z}}\) cannot be directly expressed as a non-trivial graphical model (since it is not a product space). How to transform it into a graphical model?
Non-regularity: even with the reductions introduced so far, model selection justifies special attentions because of non-regularities: the likelihood depends in a non-smooth way upon the model indicator variable. Importantly, different models have different dimensionality of the latent space. We will see that MCMC then requires special techniques called trans-dimensional MCMC.
Ratio of the marginal likelihood for two models:
\[ B_{12} = \frac{m_1(x)}{m_2(x)} \]
Values of \(B_{12}\) greater than 1.0 favor model #1 over #2. Values smaller than 1.0 favor #2 over #1.
This is just a reparameterization of the Bayes estimator with an asymmetric 0-1 loss. Note that it is different from a likelihood ratio:
\[ \frac{\sup_{z_1} \ell_1(x|z_1)}{\sup_{z_2} \ell_2(x|z_2)}, \]
which is not used within the Bayesian framework.
Many approaches exist, with pros and cons, see for example 19 dubious ways to compute the marginal likelihood of a phylogenetic tree topology
I will outline some key methods:
The idea is to build an augmented model, which can be written as a graphical model, and from which we can still approximate \(m_i(x)\).
Construction of the auxiliary latent space:
This creates the following auxiliary latent space:
\[ {\mathscr{Z}}' = \{1, 2\} \times {\mathscr{Z}}_1 \times {\mathscr{Z}}_2. \]
Example: dim(\({\mathscr{Z}}_1\)) = 1, dim(\({\mathscr{Z}}_2\)) = 2.
Construction of the auxiliary joint distribution: suppose the current state is \((\mu, z_1, z_2, x)\). We need to define an auxiliary joint density \(\tilde p(\mu, z_1, z_2, y)\).
The idea is that when \(\mu = 1\), we explain the data \(x\) using \(z_1\), and when \(\mu = 2\), we explain the data \(x\) using \(z_2\).
In notation, if \(\mu = 1\), we set:
\[ \tilde p(\mu, z_1, z_2, x) = p(\mu) p_1(z_1) p_2(z_2) \ell_1(x | z_1), \]
and if \(\mu = 2\),
\[ \tilde p(\mu, z_1, z_2, x) = p(\mu) p_1(z_1) p_2(z_2) \ell_2(x | z_2). \]
Pro: can use existing MCMC methods that do not have marginal likelihood computation methods, such as JAGS/Stan .
Cons: