Lecture 11: Diffusions

11 Feb 2014

Instructor: Alexandre Bouchard-Côté

Editor: Wooyong Lee and Huan Qi

Wrapping up phylogenetics

Loss function: if the goal is to reconstruct the tree, a standard loss function consists in counting the number of clades that disagree when comparing the true to the reconstructed one:

\begin{eqnarray} \sum_{c \subset \{1, 2, \dots, n\}} \1[\1(c \in t) = \1(c \in t')], \end{eqnarray}

where we use the fact that the clades $c$ correspond to subsets of the set of leaves $\{1, 2, \dots, n\}$.

Bayes estimator: for the loss function given above, the Bayes estimator consists in returning the tree consisting in the clades $c$ with posterior probability greater than 1/2, $\{c : \P(c \in T | Y) > 1/2\}$. It can be shown that there always exists a tree $t^*$ that contains all the consensus clades (see Semple and Steel, 2003).

MCMC proposal: we can easily build an MCMC chain that explores the space of trees. We only need the following two proposals to have an irreducible chain (exercise):

propose a new branch, for example from the exponential prior,
propose a small perturbation on the tree (picture).

Note: in practice, a combination of other proposals are used to accelerate mixing, see for example Lakner et al., 2008. Note that special proposals are also needed for ultrametric tree inference Drummond et al., 2012.

MH ratio calculation: once a new tree $t'$ has been proposed by some proposal $q$, we need to do an accept/reject step. The ratio is given by:

\begin{eqnarray} \frac{p(t')}{p(t)} \frac{\ell(y | t')}{\ell(y | t)} \frac{q(t | t')}{q(t' | t)}. \end{eqnarray}

While $p(\cdot)$ and $q(\cdot | \cdot)$ are typically easy (consider for example the proposals 1 and 2 above), $\ell(y | \cdot)$ needs some thoughts.

Exercise: show how to efficiently and exactly compute $\ell(y | \cdot)$.

Once you have established how to compute $\ell(y | \cdot)$ in the basic model outlined above, see if you can extend your method to more advanced phylogenetic models where:

each site is allowed to evolve at a different (iid latent, discretized) rate multiplying all the rate matrices at that site,
same as above, but where the latent rate categories are Markov,
codon models: where correlations between triplets of coding nucleotides are taken into account
covarion models, where the rate can change along the tree

Diffusions: overview (not comprehensive) and some pointers

Diffusion: a continuous time, continuous space stochastic process with continuous paths.

Review: Brownian motion

(see Peres's lecture note as well as textbook for more details)

Definition: A collection of random variables ${ Y_s(\omega)\in\RR ~;~ s\in [0,\infty) }$ such that:

For all $s,t \in [0,\infty)$, the increment over $(s,t]$ is normally distributed: \begin{eqnarray} (Y_{s+t} - Y_s) \sim N(0,\sigma^2 t). \end{eqnarray}
For any two disjoint intervals, $(t_1, t_2]$ and $(t_3, t_4]$, the increments $(Y_{t_2} - Y_{t_1})$ and $(Y_{t_4} - Y_{t_3})$ are independent.
The process starts at zero ($Y_0 = 0$) and its paths are continuous functions of $s$.

Sometimes, people use the term Brownian motion to point the special case of the above definition where $\sigma^2 = 1$.

Practical question: How to sample from a Brownian motion? Example: using Monte Carlo, approximate the probability that a Brownian motion crosses the line $y=+1$ at least once in the interval $[0, 1/2]$.

Important idea: Lazy computing. Often, the question we try to answer does not require the knowledge of all points $Y_s$ (or at least, can be approximated using a finite number of points). Only instantiate the value at those points $s_i$ that matter.

Note: we may not know a priori what points matter. For example, with the above question may want to dynamically refine the approximation in a region very close to the line $y = +1$.

Refining the Brownian motion FDDs: Let's say we have only sampled $Y_{0.1}$ and $Y_{0.3}$. We want to add a sample $Y_{0.2}$ after the fact. How can we do this?

Note: it can be shown that by characterization 1-3 above, $(Y_{0.1}, Y_{0.2}, Y_{0.3})$ has to be Multivariate Normal.
Mean: $(0,0,0)$.
Covariance: say $0 < s < t$, we have: \begin{eqnarray} \Cov(Y_s, Y_t) & = & \E[(Y_s-0)(Y_t-0)] \\ & = & \E[Y_s (Y_t - Y_s + Y_s)] \\ & = & \E[(Y_s)^2] + \E[Y_s - Y_0] \E[Y_t - Y_s] \\ & = & \sigma^2 s + 0. \end{eqnarray}
Therefore, we can sample from the conditional distribution, $Y_{0.2} | (Y_{0.1}, Y_{0.3})$, which is also Normal—see Rasmussen's book on Gaussian processes, available online here.

Kalman filter

Suppose that a Brownian motion (possibly multivariate) is not observed directly, but rather with normal distributed errors.

\begin{eqnarray} Z_t &\sim& \textrm{BM}(\sigma_1^2) \\ Y_{t_i} | Z_{t_i} &\sim& \Norm(Z_{t_i}, \sigma_2^2), \end{eqnarray}

where $\sigma_1^2, \sigma_2^2$ are parameters, considered fixed for now.

Our goal is to represent the posterior over $n$ latent FDDs (potentially jointly with other queries) given the observations.

Since each component of the model is normal, the model is multivariate normal, so in theory this could be done with a multivariate conditional normal calculation via the Shur complement. The major drawback of this method is that it requires inverting an $n$ by $n$ matrix, a costly operation.

Instead, one can use the factor graph framework, where each factor encodes a normal distribution, plus a normalization. The normal is univariate for unary factors, and bivariate for binary factors.

As marginalizing a multivariate normal random variable gives another multivariate normal random variable, we get that the variable elimination steps will produce a factor of the assumed type. Similarly, pointwise product of normal densities will also produce an unnormalized normal density.

The factor graph framework can therefore be used to reduce the computation to a cost linear in $n$, a considerable gain in applications where we may encounter long time series.

Note that this can be used on a tree as well, for example for allele drifts among tree-related populations (see Felsenstein (2004), chapter 23 for an accessible exposition).

Extension: Gaussian Process

Motivation: going from random function with $[0,\infty) \to \RR$ realization to random function with $\Xscr \to \RR$, where assumptions on $\Xscr$ are weak (more below).

Examples:

spatial statistics (prior over temperature field),
meta-modelling (performance on cross-validation of a black-box for various hyper-parameter settings),
general regression problems,
classification via a GLM construction.

Construction:

We want the FDDs $(X_{s_1}, \dots, X_{s_n}, s_i \in \Xscr)$ to be multivariate normal, let us say with mean zero.
All we need is therefore a way of getting a covariance matrix from any finite list of locations ${s_1}, \dots, {s_n}, s_i \in \Xscr$.
Continuity requirement, combined with the theory of Reproducing kernel Hilbert spaces tell us that the general way of getting covariance matrices in (2) is as follows:
- construct a suitable function $k$ on pairs of points, called a covariance function $k : \Xscr \times \Xscr \to [0, \infty)$ (important requirements on that function to be specified shortly).
- build the covariance matrix by evaluating $k$ on all pairs $K = (k(s_i, s_j))_{i,j}$.

Covariance function: not all functions $k$ will allow us to go through the above process. The key constraint is that we should guarantee that we get a non-negative matrix $K$ no matter what points ${s_1}, \dots, {s_n}, s_i \in \Xscr$ we use as inputs. This means:

Clearly, the function should always be symmetric.
Having all the entries positive is sufficient, but not necessary.
The general condition is non-negative definiteness ($\int \int k\; \ud \mu \times \ud \mu \ge 0$ for all $\mu$). Several equivalent characterizations exist. For example, that $k$ should be a dot product in an augmented space.

Examples:

Squared exponential: $\exp(- r^2 / 2\ell^2)$, for $r = |x - x'|$,
Matern:

\begin{eqnarray} \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}r}{\ell}\right)^\nu K_{\nu}\left(\frac{\sqrt{2\nu}r}{\ell}\right), \end{eqnarray}

where $K_\nu$ is the modified Bessel function of second kind of order $\nu$, and $\ell$ is the characteristic length scale.(see Rasmussen's tutorial for some brief introduction)

See Rasmussen and Williams (2006), chapter 4 for more characterizations and examples.

Notes:

The diagonal value of the covariance function $(k(s,s))$ can be interpreted as observation noise.
The case $k(s,s) = 0$ is permitted (and used extensively in meta-modelling applications)
The other values of $k$ can be interpreted as marginal covariances of pairs of points.

Computations:

A large field exists on alleviating the cubic computational burden. See these references. General overview:

One way to reduce the cost is to have a sparse precision matrix (inverse of the covariance matrix). In fact, with enough zeros, it becomes possible to use the factor graph algorithm from the previous section to get an exact answer.
Factor graph algorithms are only exact on trees, but they can be generalized to arbitrary graphs (and covariance matrices). See the references on EP in the above resources.
The above approach falls under the framework of variational inference. Another recent, popular GP approximation approach is that of Rue, Martino and Chopin, 2009.
Another popular approach is the Nystrom approximation, see for example Quinonero-Candela and Rasmussen, 2005.

Other diffusions: SDEs

Let us go back to random functions from $[0,\infty)$ to $\RR$. Here is a general recipe for reating new diffusion from a Brownian motion $B_t$ (assume $\sigma^2 = 1$):

Pick a discretization $0 < t_0 < t_1 < \dots < t_m < t$ with $(\Delta t)_{k} = t_{k+1} - t_{k} < M$.
Set $Q(0) = 0$
For the following $t_k$, set $Q_k = Q(t_k)$ as follows:

\begin{eqnarray} (\Delta Q)_k = Q_{k+1} - Q_k = \alpha(t_k, Q_k) \times (\Delta t)_k + \sigma(t_k, Q_k) \times (\Delta B)_k, \end{eqnarray}

where $(\Delta B)_k = B_{k+1} - B_k$ (and is therefore normal), and $\alpha$ and $\sigma$ are parameters (assumed fixed for now).

If we let $M$ go to zero, it can be shown under certain conditions on $\alpha$ and $\sigma$ (see for example Ghosh (2010)), these processes converge to a continuous diffusion process, which is a solution to the following stochastic differential equation:

\begin{eqnarray} \ud Q(t) = \alpha(t, Q(t)) \ud t + \sigma(t, Q(t)) \ud B(t). \end{eqnarray}

Since $\alpha$ and $\sigma$ can depend on both $t$ and $Q$, a wide range of processes can be created this way, with applications in population genetics, physics, finance, ecology, etc.

Simulation of SDEs

The above generative process is less satisfactory than those we have seen so far. In all previous cases, we had algorithms (procedures that terminate in finite numbers of steps) to generate dataset from the prior. Posterior simulation was often based on ideas from prior simulation.

Here the we do not have an algorithm yet, since we would need an infinite number of steps to carry $M \to 0$.

In this section we give an overview of a recently developed algorithm, Beskos and Roberts (2005), that simulate FDDs from a wide range of SDEs in a finite (but random) number of steps.

Background:

Rejection sampling
Ito's formula
Girsanov transformation

How the algorithm works:

Get rid of $\sigma$ via a first application of Ito (i.e. simulate from a different process which has $\sigma = 1$, and such that paths from the desired process can be obtained via a known, deterministic mapping).
"Lazily" propose a path according to a Brownian bridge (Brownian motion with two fixed end points), where the second end point is random but with computable density.
Accept-reject:
- Theoretical formula given using Girsanov, Ito
- The intuition is that we cannot compute exactly the formula (would require simulating the full BM path), but can bound it with BM FDDs (but see paper for a more elegant scheme and complete explanations). First sample the uniform, then refine the BM FDDs until we are sure of the outcome of the accept/reject step.

More detail and background below:

Rejection sampling:

Simplest version: we want samples from $\mu$ with density $f$ (note: we assume all densities are normalized in this discussion), say on $[0,1]$ and such that we know its maximum, $1/\epsilon$
1. sample a proposal $Y$
2. sample a Bernoulli $\epsilon f(Y)$, if true, accept $Y$ (it is then an exact sample from $\mu$, c.f. importance sampling/MCMC)
3. else, start again, discarding $Y$ (note that discarding is different than MCMC rejection, we do not keep it in our MC averages at all here)
More general version: we can propose from some arbitrary proposal $\nu$, as long as its density $g$ has a support no smaller than that of $f$. The Bernoulli acceptance is then:

\begin{eqnarray} \epsilon \frac{f(Y)}{g(Y)}. \end{eqnarray}

Finally, we do not actually need $\mu, \nu$ to be measures over the real, all we need is that $\mu$ is absolutely continuous with respect to $\nu$ (a generalization of the density support assumption used in the previous version of rejection sampling), which gives us an acceptance probability in terms of a Radon-Lebesgue-Nikodym density (generalization of density ratios):

\begin{eqnarray} \epsilon \frac{\ud \mu}{\ud \nu}(Y) \end{eqnarray}

The last version is useful because here the measures will be over diffusions.

Ito formula:

Suppose we have an SDE $X_t$, and we transform it using a twice continuously differentiable function $f(t, x)$. Recall that we will use this for example to set $\sigma = 1$ without loss of generality.

Ito's lemma states that the new process, $Y_t = f(t, X_t)$, is also a SDE, and that the new parameters $\tilde \alpha$ and $\tilde \sigma$ have a closed form expression in terms of the original parameters and of the first and second partial derivatives of $f$:

\begin{eqnarray} \ud Y(t) &= &\tilde \alpha(t, Y(t)) \ud t + \tilde \sigma(t, Y(t)) \ud B(t), \\ \tilde \alpha &=& \frac{\partial f}{\partial t} + \alpha \frac{\partial f}{\partial x} + \frac{1}{2} \sigma^2 \frac{\partial^2 f}{\partial x^2} \\ \tilde \sigma &=& \sigma \frac{\partial f}{\partial x}. \end{eqnarray}

Therefore, the original process can be transformed into a new one defined as:

\begin{eqnarray} Y_t &=& f(t, X_t) \\ f(t, x) &=& \int_0^x \frac{1}{\sigma(u)} \ud u, \end{eqnarray}

such that $Y_t$ now has unit $\tilde \sigma$ by Ito's formula, and a transformed but known $\tilde \alpha$.

Girsanov theorem: under square integrability conditions, Girsanov theorem combined with Ito's lemma gives us an expression for the Radon-Lebesgue-Nikodym density $G(B)$ of a SDE on $[0, T]$ with respect to a standard browian motion:

\begin{eqnarray} G(B) = \exp\left\{ A(B_T) - A(B_0) - \frac{1}{2} \int_0^T \left( \alpha^2(B_t) + \alpha'(B_t) \right) \ud t \right\}, \end{eqnarray}

where $A(u) = \int_0^u \alpha(y) \ud y$.

Extending this idea a bit further by picking an appropriately biased Brownian motion (obtained by sampling the end point according to a density $h(u) \propto \exp(A(u) - u^2/2T)$, and then sampling a brownian bridge from zero to this endpoint), we get a simplified Radon-Lebesgue-Nikodym density:

\begin{eqnarray} G(B) \propto \exp\left\{ - \frac{1}{2} \int_0^T \left( \alpha^2(B_t) + \alpha'(B_t) \right) \ud t \right\}. \end{eqnarray}

Let $\phi(u) = \frac{1}{2} \alpha^2(u) + \frac{1}{2} \alpha'(u) - \textrm{constant}$. The acceptance is given by a density proportional to :

\begin{eqnarray} \exp\left\{ - \int_0^T \phi(B_t) \ud t \right\}. \end{eqnarray}

Note that this still contains an infinite dimensional object, but now the integral is just a standard integral (no Ito integration required), and it is now easier to make assumptions on $\alpha$ so as to make $\phi$ bounded. From this, it is possible to build a scheme that refines our realization of $B_t$ until we are sure we need to reject or accept. See Beskos and Roberts (2005) for details.

Supplementary references and notes

Under construction