Lecture 9: Point and jump processes

03 Feb 2014

Instructor: Alexandre Bouchard-Côté

Editor: Tyler Woodbury

Point processes, continued

Application 2: Creating a Bayesian non-parametric prior over infinite latent features

Motivation: modelling customer choices via latent features.

Given a finite set of types of products (e.g. laptop, cellphone, etc.) and a finite number of choices for each, you ask people questions of the form 'you you buy an android or an iphone?'.

What we can do so far: cluster people, e.g. 'thrifty people' vs. 'extravagant'. Problem: one can imagine other useful clustering that are overlapping, e.g. 'people in their 0-19', '20-29', '30-39', '40+'. The problem is that we did not take note or have access to these (e.g. because of privacy). These are then called latent features.

Idea: instead of assigning each datapoint (person interviewed) exactly one $\theta$, let us assign them a set of thetas, $\{\theta\}$.

Given the $\{\theta\}$ we can easily build a parametric model over choices (e.g. with GLM machinery).

Matrix picture.

Beta process: Instead of an infinite list of sticks that sum to one, we want an infinite list of stick each between zero and one, but that do not need to sum to one. (can you see the difference with the gamma process?)

Then each customer flips a coin for each stick, with the bias given by the height of the stick. This specifies whether the customer picks the corresponding dish $\theta$. We want the probabilities to decay fast enough to guarantee that only a finite number of dishes are picked by each customer (but not too fast, so that the number of dishes used across all customers always grows).

It turns out it is possible to construct such random list by simply modifying the Levy measure into:

\begin{eqnarray} y^{-1} (1-y)^{c} p(x), \end{eqnarray}

where $c$ is a positive constant hyper-parameter.

With this choice, much of the machinery we have developed in the DP case can be also developed for this new process. See Miller 2011 for detail, here we will only cover the highlight:

Indian Buffet Process (IBP): an exchangeable distribution over the sampled sets $\underline{\{\theta_i\}}$, which is to the beta process what the CPR is to the DP:

The first customer picks a Poisson distributed number of dishes, with parameter $c$.
Customer $n+1$ does the following:
- Look at each dish that was picked more than once before. For each one, if $i \ge 1$ is the number of times it has been taken before, take it as well independently with probability $i/(c+n)$.
- Pick a Poisson distributed number of new dish, with rate parameter $c/n$.

Application 3: Other normalized completely random measures

Interestingly, the PY process cannot be constructed this way (by normalizing a CRM). However, other normalized CRMs offer interesting generalization to Dirichlet processes.

An important example is the Generalized Gamma CRM, with Levy measure given by:

\begin{eqnarray} \frac{a}{\Gamma(1-\sigma)} y^{-\sigma - 1} e^{-\tau y} p(x), \end{eqnarray}

where $a > 0$, $\sigma \in (0,1)$ and $\tau \ge 0$ are hyper-parameters.

See Favaro and Teh, 2013.

Note: Research project examples.

Urban statistics and MGP.
What happens if you multiply a Gamma r.v. with a PY?

Jump processes

Notation: Let $S$ denote a finite set of states. In this lecture, we will study processes $Y_t$ indexed on the real line $t \in [0, \infty)$, and taking values in $S$.

Assumption: We will be making the following Markov assumption: for all states $x_n$ and times $t_1 < t_2 < \dots < t_n$, we assume that

\begin{eqnarray} \P(X_{t_n} = x_n | X_{t_1} = x_1, \dots, X_{t_{n-1}} = x_{n-1}) = \P(X_{t_n} = x_n | X_{t_{n-1}} = x_{n-1}). \end{eqnarray}

Examples/applications:

DNA example from first assignment
Survival analysis
Queueing theory
Switching state/anomaly detection

Note: CTMC naturally handle all types of observation (full, partial/censored, unequally spaced, etc).

Recall: simulation from a CTMC (lecture 1):

Initialize $y = y_0$, $p = ()$ to an empty list.
Repeat:
1. Simulate a waiting time: use an exponential distribution with the rate corresponding to the current state, $\Delta t \sim \Exp(\lambda_y)$.
2. Add a new segment to the piecewise constant path: $p \gets p \circ (y, \Delta t)$, where $\circ$ denotes concatenation (i.e. we represent a path with a list of pair, where each pair has the length of a segment, $\Delta t$, and its state $y$).
3. Simulate a transition: use a categorical distribution with the probabilities given by row $y$ of the transition matrix, $y' \sim \Cat(M(y,\cdot))$.
4. Update the current state: $y \gets y'$

Why Markov assumption implies we can use this simple generative process

Short answer: Memorylessness.

Recall: A random variable $T$ is memoryless if

\begin{eqnarray} \P(T > t + s | T > s) = \P(T > t), \end{eqnarray}

or equivalently:

\begin{eqnarray} \P(T > t + s) = \P(T > t) \P(T > s). \end{eqnarray}

Recall: the exponential distribution is the only continuous memoryless distribution. In other words, $P_t = \P(T > t)$ is continuous with $P_0 = 1$ and $P_{t+s} = P_t P_s$, iff $P_t = e^{-\lambda x}.$

Proposition: homogeneous, finite state Markov chains can be simulated using the algorithm from lecture 1.

Proof: Suppose $X_0 = x$, and denote the time to the first jump by $T_x$. It is enough to show that $T_x$ is memoryless:

\begin{eqnarray} \P(T_x > s + t | T_x > s) &=& \P(X_r = x \textrm{ for }r\in[0, s+t] | X_r = x \textrm{ for }r \in [0,s]) \\ &=& \P(X_r = x \textrm{ for }r\in[s, s+t] | X_r = x \textrm{ for }r \in [0,s]) \\ &=& \P(X_r = x \textrm{ for }r\in[s, s+t] | X_s = x ) \textrm{ (Markov assumption) } \\ &=& \P(X_r = x \textrm{ for }r\in[0, t] | X_0 = x ) \textrm{ (homogeneity assumption) }\\ &=& \P(T_x > t). \end{eqnarray}

Estimation of CTMCs: fully observed case

Assume first that the full path can be observed. We will relax this assumption later. Suppose also that we observe a single time series, with times holding times $(t_1, t_2, t_3)$ and states $(x_1, x_2, x_3)$. Suppose also that the pmf of the initial distribution is $\mu$.

Bayesian way: one could put gamma priors on the $\lambda$s and Dirichlet priors on the $M$s. This yields a simple conjugate family. (But see Ferreira and Suchard for other, more robust choices of priors based on conditional reference prior, which exhibit better frequentist coverage properties.)

Maximum likelihood: The likelihood for fully observed sequences is:

\begin{eqnarray} \mu(x_1) \left( \lambda(x_1) e^{-t_1 \lambda(x_1)} M(x_1, x_2) \right) \times \left( \lambda(x_2) e^{-t_2 \lambda(x_2)} M(x_2, x_3) \right) \times e^{-t_3 \lambda(x_3)} \end{eqnarray}

Note 1: why is the last factor different?

Note 2: as you saw in the first assignment (and will be discussed in more details today or next lecture), $\lambda$ and $M$ can be reparameterized into a rate matrix $Q = (q_{i,j})$ as follows:

\begin{eqnarray} \lambda(x) &=& -q_{x,x}, \\ M(x, x') &=& \frac{q_{x, x'}}{-q_{x, x}}, \end{eqnarray}

obtaining:

\begin{eqnarray} \mu(x_1) \left( e^{-t_1 q_{x_1}} q_{x_1, x_2} \right) \times \left( e^{-t_2 q_{x_2}} q_{x_2, x_3} \right) \times e^{-t_3 q_{x_3}} \end{eqnarray}

Note 3: Finally, the terms can be grouped, revealing that the total time spent at each state, $T(x)$, and the number of transitions between each pair of states $N(x, x')$, are sufficient statistics:

\begin{eqnarray} \mu(x_1) \prod_x e^{- q_{x,x} T(x)} \prod_{x' \neq x} q_{x, x'}^{N(x, x')}. \end{eqnarray}

Maximizing the log likelihood with the constraint that $- q_{x,x} = \sum_{x' \neq x} q_{x, x'}$ is then straightforward (see Hobolth and Yoshida, 2005).

Estimation of CTMCs: partially observed case

Suppose now that we only observe the end-points $x,y$ of one (or several) time series. Suppose that the time between the observation $t$ is also known. We need to marginalize over the intermediate paths, $\P_q(X_0 = x, X_t = y) = \mu_q(x) \P_q(X_t = y | X_0 = x)$ (maximizing over them leads to a degenerate solution, see Perkins, 2009). Here the $q$ subscript denotes the dependence of the marginal probabilities on the rate matrix $Q$ we are optimizing over.

The main tool we will cover here is the matrix exponential, which allows us to do the marginalization analytically.

Note that once we can compute $\P_q(X_t = y | X_0 = x)$, it is straightforward to design a MH sampler: let $p(q)$ denote a prior density on the entries of $q$. Given a proposal on $q$ with density $\rho$, we accept a new value $q'$ by calculating the ratio:

\begin{eqnarray} \frac{p(q')}{p(q)} \frac{\P_{q'}(X_t = y | X_0 = x)}{\P_q(X_t = y | X_0 = x)} \frac{\rho(q | q')}{\rho(q' | q)}. \end{eqnarray}

Review/background: How to compute the marginal?

Let $(P_t)_{i,j} = \P(X_t = j | X_0 = i)$. What do we know about this matrix valued function $P_t$?

$P_0 = I$
Kolmogorov consistency: $(P_t)_{i,j} = \sum_k (P_t){i,k} (P\t)_{k,j}$

Note: how do we write this in vector notation?

Compare to earlier discussion on memorylessness: Extension: (1) and (2) hold iff $Q$ such that

\begin{eqnarray} P_t &=& \exp(tQ) \\ \exp(M) &=& \sum_{n=0}^\infty \frac{M^n}{n!}. \end{eqnarray}

We will see next lecture $Q$ is actually the same as the $Q$ used in the generative process, just as the name suggests.

Computing the matrix exponential:

Suppose $Q$ an be diagonalized, $A = U D U^{-1}$, $D = \textrm{diag}(\lambda_i)$.

Write:

\begin{eqnarray} \exp(tQ) &=& U U^{-1} + U (tD) U^{-1} + \frac{1}{2!} U(tD) U^{-1} U (tD) U^{-1} + \dots \\ &=& U (I + (tD) + \frac{1}{2!} (tD)^2 + \dots) U^{-1} \\ &=& U \textrm{diag}(\exp(t \lambda_i)) U^{-1} \end{eqnarray}

Supplementary references and notes

Under construction