Mixture models: motivation
Setup molecular barcodes in cancer cells
- patient derived breast cancer samples
- CRISPR-Cas9 is used to:
- edit genome of cancer cell (about 200 genes targeted, we look at only 1 of them)
- many cells get the same edit, but we also embed a unique molecular identifier in individual cells
- sample is then implanted in immuodeficient mouse
- after a few week, tumour has grown and is sequenced
Note when the cell divide, the barcode gets copied. So at the end of the process, each “unique” barcode is no longer unique! At the end of the process, the number of times we sequence each barcode gives us an indication of the number of progenies the original CRISPR edited cell gave rise too. Hence a notion of “family size”

Data the number of times we sequenced each barcode

Challenges
- we do not have a theory (yet) that provides a distribution family for the likelihood of the data
- as we will see, default choices, for example geometric, Poisson, or even negative binomial, do not work well (hint of the root cause: presence of extremes)
Poll: How to check if say a Poisson is a good choice?
- Posterior predictive checks
- Compute Bayes factors
- Leave one out probability integral transform
- Simulation based calibration
- None of the above
Mixture
- Solution: use mixtures of known distributions
- Example: MixNB, a mixture of two Negative Binomial distributions, each with a cluster-specific parameter vector \(\theta_i = (\theta_{i,1}, \theta_{i,2})\)
\[f_{\text{MixNB}}(y;\theta_1, \theta_2, p) = p f_{\text{NB}}(y; \theta_1) + (1-p) f_{\text{NB}}(y; \theta_2)\]

- Examples of priors on \(p\): uniform or beta.
Cluster indicators
Consider these two models (for simplicity, consider \(\theta = (\theta_1, \theta_2), p\) as fixed for now)
- The mixture we are familiar with
- \(y | \theta, p \sim p f(y; \theta_1) + (1-p) f(y; \theta_2)\)
- The two step process based on a cluster indicators \(z\)
- \(z | p \sim {\text{Bern}}(p)\)
- \(y | z, \theta \sim f(\cdot; \theta_z)\)
For concretess, just as an example let us say \(f\) is Poisson, \(\theta = (1, 2)\), \(p = 1/2\). Let \({\mathbb{P}}_\text{mix}\) and \({\mathbb{P}}_\text{clust}\) denote these two models
Poll: compare \({\mathbb{P}}_\text{mix}(Y = 7)\) and \({\mathbb{P}}_\text{clust}(Y = 7)\)
- In both cases \(Y\) is marginally Poisson-distributed, so the two probability are equal
- The two probability are equal for other reasons
- \({\mathbb{P}}_\text{mix}(Y = 7) < {\mathbb{P}}_\text{clust}(Y = 7)\)
- \({\mathbb{P}}_\text{mix}(Y = 7) > {\mathbb{P}}_\text{clust}(Y = 7)\)
Cluster indicators
- The two models probabilities are always equal (not just for \(Y=7\))
- The fact that the random variables are Poisson has nothing to do with it (do not confuse \(Y = Y_1 + Y_2, Y_i \sim \text{Poi}(\theta_i)\) versus a mixture, \(Y = Y_1 {{\bf 1}}[Z = 0] + Y_2 {{\bf 1}}[Z = 1]\))
\[
\begin{align*}
{\mathbb{P}}_\text{mix}(Y = 7) &= \sum_{z=0}^1 {\mathbb{P}}_\text{mix}(Z = z) {\mathbb{P}}_\text{mix}(Y = 7 | Z = z) \\
&= p f(y; \theta_1) + (1-p) f(y; \theta_2) \\
&= {\mathbb{P}}_\text{clust}(Y = 7)
\end{align*}
\]
Terminology we have two representations for the same model
Example of a summarization method that is invariant to cluster relabelling
- Use the Bayes estimator!
- To define a loss, instead of thinking of the clustering as a vector of integers, \(z\), view it as a partition (set of disjoint sets) of the \(n\) datapoints
- True partitions \({\rho}\)
- Guessed partition \({{\rho'}}\)
- rand loss function is denotes as \({\textrm{rand}}({\rho}, {{\rho'}})\).
\[
\sum_{1\le i < j \le n} {{\bf 1}}[(i \sim_{\rho}j) \neq (i \sim_{{\rho'}}j)],
\]

Goal: computing the Bayes estimator derived from the rand loss.
First, we can write:
\[
\begin{aligned}
{\textrm{argmin}}_{\textrm{partition }\rho'} {\mathbf{E}}\left[{\textrm{rand}}(\rho, \rho')|X\right] &=
{\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} {\mathbf{E}}\left[{{\bf 1}}\left[\rho_{ij} \neq \rho'_{ij}\right]|X\right] \\
&= {\textrm{argmin}}_{\textrm{partition }\rho'} \sum_{i<j} \left\{(1-\rho'_{ij}){\mathbb{P}}(\rho_{ij} = 1|X) + \rho'_{ij} \left(1- {\mathbb{P}}(\rho_{ij} = 1 |X)\right)\right\}
\end{aligned}
\]
where \(\rho_{i,j} = (i \sim_{\rho} j)\), which can be viewed as edge indicators on a graph.
The above identity comes from the fact that \(\rho_{i,j}\) is either one or zero, so:
- the first term in the the brackets of the above equation corresponds to the edges not in the partition \(\rho\) (for which we are penalized if the posterior probability of the edge is large), and
- the second term in the same brackets corresponds to the edges in the partition \(\rho\) (for which we are penalized if the posterior probability of the edge is small).
This means that computing an optimal bipartition of the data into two clusters can be done in two steps:
- Simulating a Markov chain, and use the samples to estimate \({s}_{i,j} = {\mathbb{P}}(\rho_{ij} = 1 | Y)\) via Monte Carlo averages.
- Minimize the linear objective function \(\sum_{i<j} \left\{(1-\rho_{ij}){s}_{i,j} + \rho_{ij} \left(1- {s}_{i,j}\right)\right\}\) over bipartitions \(\rho\).
Note that the second step can be efficiently computed using min-flow/max-cut algorithms (understanding how this algorithm works is outside of the scope of this lecture, but if you are curious, see CLRS, chapter 26).