Lecture 10: Jump processes: asymptotic behavior and applications

05 Feb 2014

Instructor: Alexandre Bouchard-Côté

Editor: Yumi Kondo

Jump processes, continued

Review from last time: computing the matrix exponential.

Why? A tool needed for MCMC approximation to the posterior over $Q$ given partial observations.

What is the interpretation? An end-point conditioned probability, $P_t = \exp(tQ) = \P(X_t = y | X_0 = x)$, marginalizing over all possible paths. Result organized into a matrix where rows are $x$ indices and columns are $y$ indices.

What is the definition? Same Taylor series as standard exponential, with a matrix argument instead.

\begin{eqnarray} \exp(M) &=& \sum_{n=0}^\infty \frac{M^n}{n!}. \end{eqnarray}

How to compute it in practice? Suppose $Q$ an be diagonalized, $A = U D U^{-1}$, $D = \textrm{diag}(\lambda_i)$.

Write:

\begin{eqnarray}\label{eq:calc-matrix-exp} \exp(tQ) &=& U U^{-1} + U (tD) U^{-1} + \frac{1}{2!} U(tD) U^{-1} U (tD) U^{-1} + \dots \\ &=& U \left(I + (tD) + \frac{1}{2!} (tD)^2 + \dots\right) U^{-1} \\ &=& U \textrm{diag}(\exp(t \lambda_i)) U^{-1} \end{eqnarray}

Note: You may also encounter cases where $Q$ is not diagonalizable (in particular, when the CTMC is not reversible, a notion to be defined shortly). In these cases, there are many other methods, but each has it strengths and weaknesses. Moler and Van Loan, 2003 describes 19 ways, from which one of the most useful in the non-diagonalizable case is the squaring and scaling method combined with Pade approximations. This is what is used in BLAS (which can be accessed in java via JBLAS).

Note that the squaring and scaling is a simple, generic trick that makes matrix exponential calculation more numerically stable. It is based on the following identity, valid for all integer $m$:

\begin{eqnarray} \exp(s A) = \left(\exp(s A / m)\right)^m. \end{eqnarray}

Stationary distribution of a CTMC

All the stationarity concepts from discrete time Markov chain carry to the CTMC setup by looking at the skeleton of the CTMC:

Definition: the skeleton (or $\delta$ skeleton) of a CTMC $X_t, t\in[0, \infty)$ is the following sequence of marginals: $X_0, X_{\delta}, X_{2\delta}, \dots$. In the following discussion, the value $\delta$ takes will not matter.

Note: we can easily compute the transition probability of this skeleton via matrix exponentiation!

Consequence: One simple way of getting the stationary distribution:

\begin{eqnarray} \lim_{t\to\infty} \P(X_t = y | X_0 = x) &=& \lim_{n \to \infty} \P(X_{n\delta} = x | X_0 = y) \\ &=& \lim_{n\to\infty} ((P_\delta)^n)_{x,y}, \end{eqnarray}

which can be computed by doing a standard eigendecomposition of $P_\delta$.

A more direct way: let us look at the stationary equation for the skeleton, and take derivatives with respect to $\delta$ on both sides:

\begin{eqnarray} \frac{\ud}{\ud \delta} \pi &=& \pi \frac{\ud}{\ud \delta} \exp(\delta Q) \Longleftrightarrow \\ 0 &=& \pi Q \exp(\delta Q) \Longleftrightarrow \\ 0 &=& \pi Q, \end{eqnarray}

where we used:

  • between the first and second lines: the fact that derivatives of the matrix exponentials with respect to the scalar $t$ behave like derivatives of scalar exponential (see wikipedia page),
  • between the second and third lines: the fact that the result of matrix exponentiation is positive definite hence invertible (this follows from Equation~(\ref{eq:calc-matrix-exp}) and the characterization of positive definiteness in terms of positive eigenvalues).

Reversibility: the stronger notion of reversibility also applies to CTMCs, again via the skeleton,

\begin{eqnarray} \pi_i \left(P_\delta\right)_{i,j} = \pi_j \left(P_\delta\right)_{j,i}. \end{eqnarray}

As before, we can get a more direct criterion by taking derivatives on both sides:

\begin{eqnarray} \pi_i \left(Q \exp(Q\delta)\right)_{i,j} = \pi_j \left(Q \exp(Q\delta)\right)_{j,i}, \end{eqnarray}

and setting $\delta = 0$ yields:

\begin{eqnarray} \pi_i Q_{i,j} = \pi_j Q_{j,i}. \end{eqnarray}

In fact, as shown in Tavaré, 1986, it is possible to write the rate matrix of any time reversible $K$ state CTMC as $q_{i,j} = \pi_j \theta_{\{i,j\}}$, where $\theta_{\{i,j\}}$ are $\binom{K}{2}$ non-negative parameters.

Uniformization and maximum likelihood with partial observation

Recall the expression of the fully observed likelihood from last time:

\begin{eqnarray} \mu(x_1) \prod_x e^{- q_{x,x} T(x)} \prod_{x' \neq x} q_{x, x'}^{N(x, x')}. \end{eqnarray}

Let us denote this function as $f_q(n,h)$, using $h$ for holding time instead of $T$, and $n$ for transition counts.

The approach we will take is an EM approach:

\begin{eqnarray}\label{eq:em} q^{(t)} \gets \textrm{argmax}_q \E_{q^{(t-1)}}[\log f_q(N, H) | Y], \end{eqnarray}

where $N$ and $H$ are the random vectors counting the number of transitions and total holding times. Note that $\log f$ is linear in $n$, $h$, therefore Equation~(\ref{eq:em}) reduces to:

\begin{eqnarray} q^{(t)} \gets \textrm{argmax}_q \log f_q(\E_{q^{(t-1)}}[N|Y], \E_{q^{(t-1)}}[H|Y]). \end{eqnarray}

We therefore need to compute the posterior expected transition counts and holding times in the interval $[0, t]$.

This can be done in several ways:

  • Exact, deterministic algorithms exist, but are fairly complicated and can be slow in some cases. See this overview by Tataru and Hobolth, 2011.
  • Approximate methods where paths are sampled given the two end-points. See this overview by Hobolth and Stone, 2009. In particular, we will talk here about those based on uniformization.

Uniformization:

First: note that if all off-diagonal entries in the rate matrix were all equal, this would be easy. (Why?)

General case: suppose we have an arbitrary matrix now. Uniformization works as follows:

  1. Let $\Omega$ denote the maximum diagonal element in absolute value, $\Omega = \max_i |q_{i,i}|$.
  2. Simulate a set of points in the interval $[0, t]$ according to a poisson process with uniform intensity $\Omega$ on $[0, t]$. Denote the number of points hence sampled by $n$.
  3. Create a new matrix $B = I + \frac{1}{\Omega} Q$.
    • Note: this is a transition matrix!
    • Note: it has self-transitions.
  4. Simulate $n$ transition steps of a discrete time Markov chain starting at $x$ and ending at $y$ (how?).
  5. Use the location of the points of the poisson process to transform the above discrete process into a continuous one.

Proposition: this procedure simulates from the posterior over paths given the end-points.


Proof sketch: Let $\pi_t$ denote the marginal at time $t$ given an initial distribution $\mu$. We have:

\begin{eqnarray} \pi_t &=& \exp(Qt) \mu \\ &=& \exp(\Omega (B - I)t) \mu \\ &=& \exp(- \Omega t) \exp(\Omega t B) \mu \\ &=& \sum_{n=1}^\infty \underbrace{\left( \exp(-\Omega t) \frac{(\Omega t)^n}{n!} \right)}_{\textrm{Poisson part}} \underbrace{(B^n \mu)}_{\textrm{Factor graph part}} \end{eqnarray}


Application: phylogenetics.

Example of data: related nucleotide sequences $Y$ for $n > 2$ species.

Statistical tasks:

  • Find a phylogenetic tree explaining the evolution of these species (more on that below).
  • Reconstruct in silico the sequences of extinct ancestors (from this, it is even possible to to in vitro reconstruction of ancestral proteins).
  • Compare and/or estimate models of evolution (as we will see, a rate matrix is a major component of evolutionary models).
  • Assess our uncertainty on all of the above.

Phylogenetic tree: a data structure representing a model of evolution. It contains a topology (discrete tree-shaped graph, which we assume to be directed for now), a set of branch lengths (numbers greater than zero, one associated to each edge in the graph).

Terminology: Some background terminology on phylogenies:

  • the most common recent ancestor is called the root of the phylogeny;
  • view the edges of the graphs as line segments of length specified by the branch lengths;
  • each point of the tree represents a certain generation of a specie;
  • points at the leaves are modern species;
  • junction points usually represent speciation events, they are called internal nodes;
  • trees come in several flavors:
    1. clock or ultrametric trees, which are trees such that for all fixed internal node $v$, the path length from $v$ to any descendent leaf is identical;
    2. non-clock tree, where we do not impose the above constraint.

Note:

  • For clock trees, finding the root is trivial.
  • For non-clock, it is more difficult. In particular, it is unidentifiable for reversible evolutionary models (exercise). For this this reason, non-clock trees are often represented in an unrooted fashion, where the directionality of the edges is dropped, and where statistical methods do not attempt to place the root.

Last definition: a clade $c$ is a division (bi-partition) of the leaves obtained by cutting an edge. We write $c\in t$ when $t$ admits clade $c$.

Tree model (prior): with density $p(t)$, which can be constructed as follows.

  • A simple choice in the non-clock case is to pick the topology uniformly, and the branch lengths to be independent exponential random variables
  • unclock case example: the coalescent (see for example the background Section 2 of Gorur and Teh, 2009). Note that the coalescent also provides a simple method for doing forward simulation of topologies that are uniformly distributed.

Evolutionary model: denoted $\ell(y | t)$.

  • Assume first that there is a single nucleotide in each sequence. This is clearly unrealistic, but it will be easy to relax this later.
  • Assume for now that a 4x4 rate matrix $Q$ has been specified.
  • Generative process:

    1. Generate the nucleotide at the root according to the stationary distribution $\pi$ of the rate matrix $Q$.
    2. Traverse the tree in preorder. At each node $v$, simulate a CTMC for each edge going out of $v$. These are all independent conditionally on their initial value, which is the value of the nucleotide at node $v$ for all branches. Note that the CTMC along each edge is simulated for a time given by the branch length of the edge.
    3. Discard everything generated except for the characters at the leaves.

    Note: more than one scenario (and in fact, a continuum of scenarios) can generate the same observations!

Supplementary references and notes

Under construction