Mixtures

Alexandre Bouchard-Côté

Mixture models: motivation

Mixture models: motivation

Setup molecular barcodes in cancer cells

Note when the cell divide, the barcode gets copied. So at the end of the process, each “unique” barcode is no longer unique! At the end of the process, the number of times we sequence each barcode gives us an indication of the number of progenies the original CRISPR edited cell gave rise too. Hence a notion of “family size”

drawing

Data the number of times we sequenced each barcode

Challenges

Poll: How to check if say a Poisson is a good choice?

  1. Posterior predictive checks
  2. Compute Bayes factors
  3. Leave one out probability integral transform
  4. Simulation based calibration
  5. None of the above

Goodness-of-fit checks!

drawing drawing drawing

Mixture

\[f_{\text{MixNB}}(y;\theta_1, \theta_2, p) = p f_{\text{NB}}(y; \theta_1) + (1-p) f_{\text{NB}}(y; \theta_2)\]

drawing drawing

Finite mixtures

Infinite mixtures

MFM versus BNP

Mixture models: a second application

Cluster indicators

Consider these two models (for simplicity, consider \(\theta = (\theta_1, \theta_2), p\) as fixed for now)

For concretess, just as an example let us say \(f\) is Poisson, \(\theta = (1, 2)\), \(p = 1/2\). Let \({\mathbb{P}}_\text{mix}\) and \({\mathbb{P}}_\text{clust}\) denote these two models

Poll: compare \({\mathbb{P}}_\text{mix}(Y = 7)\) and \({\mathbb{P}}_\text{clust}(Y = 7)\)

  1. In both cases \(Y\) is marginally Poisson-distributed, so the two probability are equal
  2. The two probability are equal for other reasons
  3. \({\mathbb{P}}_\text{mix}(Y = 7) < {\mathbb{P}}_\text{clust}(Y = 7)\)
  4. \({\mathbb{P}}_\text{mix}(Y = 7) > {\mathbb{P}}_\text{clust}(Y = 7)\)

Cluster indicators

\[ \begin{align*} {\mathbb{P}}_\text{mix}(Y = 7) &= \sum_{z=0}^1 {\mathbb{P}}_\text{mix}(Z = z) {\mathbb{P}}_\text{mix}(Y = 7 | Z = z) \\ &= p f(y; \theta_1) + (1-p) f(y; \theta_2) \\ &= {\mathbb{P}}_\text{clust}(Y = 7) \end{align*} \]

Terminology we have two representations for the same model

Model based clustering

drawing drawing

Summarization of the posterior and label switching

Summarization

Label switching

drawing

Poll: how to obtain a summarization method that is invariant to cluster relabelling

  1. Use the mean of each \(z_i\)
  2. Use the median of each \(z_i\)
  3. Use the mode of each \(z_i\)
  4. Something else

Example of a summarization method that is invariant to cluster relabelling

\[ \sum_{1\le i < j \le n} {{\bf 1}}[(i \sim_{\rho}j) \neq (i \sim_{{\rho'}}j)], \]

Goal: computing the Bayes estimator derived from the rand loss.

First, we can write:

\[ \begin{aligned} {\textrm{argmin}}_{\textrm{partition }\rho'} {\mathbf{E}}\left[{\textrm{rand}}(\rho, \rho')|X\right] &= {\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} {\mathbf{E}}\left[{{\bf 1}}\left[\rho_{ij} \neq \rho'_{ij}\right]|X\right] \\ &= {\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} \left\{(1-\rho'_{ij}){\mathbb{P}}(\rho_{ij} = 1|X) + \rho'_{ij} \left(1- {\mathbb{P}}(\rho_{ij} = 1 |X)\right)\right\} \end{aligned} \]

where \(\rho_{i,j} = (i \sim_{\rho} j)\), which can be viewed as edge indicators on a graph.

The above identity comes from the fact that \(\rho_{i,j}\) is either one or zero, so:

This means that computing an optimal bipartition of the data into two clusters can be done in two steps:

  1. Simulating a Markov chain, and use the samples to estimate \({s}_{i,j} = {\mathbb{P}}(\rho_{ij} = 1 | Y)\) via Monte Carlo averages.
  2. Minimize the linear objective function \(\sum_{i<j} \left\{(1-\rho_{ij}){s}_{i,j} + \rho_{ij} \left(1- {s}_{i,j}\right)\right\}\) over bipartitions \(\rho\).

Note that the second step can be efficiently computed using min-flow/max-cut algorithms (understanding how this algorithm works is outside of the scope of this lecture, but if you are curious, see CLRS, chapter 26).

Another pitfall

Readings