Lecture 7: Related topics in computational statistics

27 Jan 2014

Instructor: Alexandre Bouchard-Côté

Editor: Vincent Zhai and Tingting Zhao

Related topics in computational statistics

By now, you should all be familiar with hard inference problems of the form \begin{eqnarray}\label{eq:problem} \bar \pi(x) = \frac{\pi(x)}{Z}, \end{eqnarray} where the goal is to compute an expectation with respect to the density $\bar \pi$ and/or estimate $Z = \normof{\pi}$. For simplicity, for the first part of today we will assume the space is discrete, so you can think of $\bar \pi$ as a p.m.f.

Our running example will be the posterior of seatings under a CRP.

We will call $\phi$ the function we want to take expectation of:

\begin{eqnarray} \bar \pi \phi = \sum_x \phi(x) \bar \pi(x). \end{eqnarray}

Example of $\phi$?

\begin{eqnarray} \phi = 1(i and j are sitting together in x) \end{eqnarray}

MCMC is one solution but suffers from some problems:

  • Fundamentally sequential, making it harder to adapt to parallel architectures (GPUs, multi-processors, clusters)
  • Reversible moves can be difficult to construct in some situations (e.g. requiring reversible jumps)
  • The normalization constant $Z$ is non-trivial to estimate from the MCMC output
  • Non-trivial to analyze

Today we will discuss other approaches to Equation~(\ref{eq:problem}):

  • Sequential Monte Carlo (SMC)
    • Partical Markov chain Monte Carlo (PMCMC)
    • Generalized view (if time permits)
  • Exact propagation of messages
  • Hamilton Monte Carlo (HMC, if time permits)

Sequential Monte Carlo (SMC)

Background: The foundation of SMC is importance sampling (IS). IS is a way of building a random measure $\tilde \pi^{(K)}$ such that:

  • $\pi^{(K)}$ approximates $\pi$,
  • it is easy to compute expectations and normalization of $\pi^{(K)}$,
  • these two approximations converge to the true value as $k \to \infty$.

IS works by sampling some samples $X_i$ according to a proposal with density $q$, and building a random measure with atoms at the $X_i$s

\begin{eqnarray} \pi^{(K)} = \sum_k w^{(k)} \delta_{X^{(k)}} \end{eqnarray}

where the random weights are given by $w^{(k)} = w(X^{(k)})$, with:

\begin{eqnarray} w(x) = \frac{\pi(x)}{q(x)} \end{eqnarray}

Note:

  1. the normalization $\normof{\pi^{(K)}}$ of this approximation is easy to compute (what is it?)
  2. expectations with respect to $\bar \pi^{(K)}$ are easy to compute (what do they look like?)

As long as the support of $q$ is no smaller than the support of $\pi$, we have convergence almost surely (a.s.) of the estimates 1 and 2 above to their targets $Z$ and $\bar \pi \phi$. The key to proving this argument (exercise) is a change of measure:

\begin{eqnarray} Z &=& \sum_x \pi(x) \\ &=& \sum_x \left(\frac{\pi(x)}{q(x)}\right) q(x), \end{eqnarray}

which can be interpreted as $\E[w(X)]$ if $X \sim q$. By the LLN, the approximation

\begin{eqnarray} \frac{1}{K} \sum_k w\left(X^{(k)}\right) \end{eqnarray}

will therefore converge to $Z$ if the $X^{(k)}$ are iid samples from $q$.

Sequential Importance Sampling (SIS): In the CRP case, it is more natural to write $q$ as a product of conditionals:

\begin{eqnarray} q(x) = \prod_n q(x_{1:n} | x_{1:n-1}), \end{eqnarray}

where $x_{1:n}$ is the seating arrangement of customers 1 through $n$.

We can modify the IS to exploit this structure. This is useful for example in an online setting, where new datapoints come in dynamically.

To do this, we need to consider not only one target distribution $\pi$, but instead several target distributions given by

\begin{eqnarray} \bar \pi_n = \frac{\pi_n}{Z_n}, \end{eqnarray}

where $\bar \pi_n$ can be interpreted as the posterior distribution on seating arrangement given datapoints 1 through $n$.

With this in hand, we now simply use a telescoping product to rewrite the weight as a sequence of weight updates:

\begin{eqnarray} \pi_n &=& \left( \frac{\pi_1}{\pi_0} \right) \left( \frac{\pi_2}{\pi_1} \right) \dots \left( \frac{\pi_n}{\pi_{n-1}} \right) \\ w_n(x_{1:n}) &=& \frac{\pi_n(x_{1:n})}{\pi_{n-1}(x_{1:n-1})} \frac{1}{q(x_{1:n}|x_{1:n-1})} \\ w &=& \prod_n w_n \\ \end{eqnarray}

where $\pi_0 \equiv 1$.


Example: weight update for the CRP posterior.

A simple proposal consists in using the CRP prior probabilities, $q(x_{1:n}|x_{1:n-1}) = \crp(\part(x_{1:n})|\part(x_{1:n-1}))$. If we let $\part = \part(x_{1:n-1})$ and $\part' = \part(x_{1:n})$, we get:

\begin{eqnarray} w_n(x_{1:n}) &=& \frac{\CRP(\part') \left( \prod_{B' \in \part'} m(y_{B'}) \right)}{\crp(\part) \left( \prod_{B \in \part} m(y_B) \right)} \frac{1}{\crp(\part(x_{1:n})|\part(x_{1:n-1}))} \\ &=& \frac{m(y_{B'_0})}{m(y_{B_0})}. \end{eqnarray}

Here $B_0$ and $B'_0$ is the block of customers before and after insertion of the new customer. Since $B'_0 = B_0 \cup \{n\}$, we get: $w_n(x_{1:n}) = \ell(y_n | y_{B_0})$.


Another view of this algorithm is that we have a sequence of approximating measures $\pi^{(K)}_n$, where approximation $\pi^{(K)}_n$ is obtained recursively from $\pi^{(K)}_{n-1}$ by randomly propagating each of the atoms via $q(\cdot|\cdot)$, and updating the weights via $w_n$. Since we have just rewritten the algorithm in an equivalent form it clearly keeps the asymptotic properties (in $K$) from last section.

One can view this as mass propagation/transportation on a tree organized by layers.

Resampling: Since we are often less interested in the previous measures $\pi_{1}, \pi_{2}, \dots$ and more interested in the latest one $\pi_{n}$, it is useful to trade a loss of information in the far past for a more accurate approximation near the present iteration. This is done via resampling.

The goal of resampling is to discard the particles of low weight while preserving the asymptotic guarantees outlined in the previous section.

In its simplest form (multinomial resampling), it consists in two steps:

  1. Normalize the weights $w^{(1)}_n, \dots, w^{(K)}_n$ into $\bar w^{(1)}_n, \dots, \bar w^{(K)}_n$
  2. Resample $K$ times from the categorical distribution with parameters $\bar w^{(1)}_n, \dots, \bar w^{(K)}_n$.

Let $\tilde x^{(1)}, \dots, \tilde x^{(K)}$ denote the list of particles sampled from step 2 above (note that there can be repeats). We construct the following new approximation from the $\tilde x^{(k)}$:

\begin{eqnarray} \tilde \pi^{(K)}_n = \sum_k \frac{1}{K} \delta_{\tilde x^{(k)}}. \end{eqnarray}

Note that the weights have been all reset to $1/K$. Informally, this is because the random number of repeats now encodes $\bar w^{(K)}_n$. Formally, it is a good exercise to show that for all fixed $A$,

\begin{eqnarray} \E\left[\tilde \pi^{(K)}(A) | \pi^{(K)}\right] = \pi^{(K)}(A). \end{eqnarray}

Using this and Minkowsky's inequality, it is not too hard to show that the resampling preserves the asymptotic guarantees as well (hint: prove convergence in L2). In particular, the normalization can now be consistently approximated by:

\begin{eqnarray}\label{eq:marg-smc} Z^{(K)} = \prod_n \frac{1}{K}\sum_k w_n^{(k)}. \end{eqnarray}

Particle MCMC (PMCMC)

Suppose we want to put a prior on resampling $\alpha_0$ and/or likelihood hyper-parameters $h$. Different values can induce potentially very different clusterings (especially for $h$). How can we change our SMC algorithm to handle such global variables $\theta = (\alpha_0, h)$? This is where PMCMC comes in.

PMCMC methods are essentially MCMC algorithms in which SMC acts as a MH proposal. There is therefore an MCMC algorithm in the outer loop, indexed by $j$, and a SMC algorithm in the inner loop, with particle weights indexed by $w^{(j,k)}_n$.

Simplest PMCMC algorithm: PMMH

Suppose first we only care about getting a posterior over $\theta$. Let $p(\theta)$ denote its prior density. The obstacle is that $Z_{\theta} = p(y | \theta)$, where $y$ is the data, is difficult to compute (recall that it would involve summing over all partitions $\part$ of the $n$ datapoints). On the other hand, for any fixed value of $\theta$, we can get an approximation of $Z_{\theta} = p(y | \theta)$ provided by running the SMC algorithm of the previous section and returning Equation~(\ref{eq:marg-smc}).

Therefore, PMMH stores two quantities at each MCMC iteration $j$:

  1. A current value of the global variable $\theta^{(j)}$.
  2. An SMC estimate $Z^{(j)}$ corresponding to $\theta^{(j)}$, i.e. an approximation of the marginal density of the data given $\theta^{(j)}$.

At each MCMC iteration $j$, PMMH takes these steps:

  1. It proposes a new value for the global variable, $\theta'$, using a proposal $q(\theta'|\theta)$ (note the abuse of notation, $q$ is also used for the very different proposal used within the SMC inner loop).
  2. It computes an approximation $Z'$ to the marginal density of the data given $\theta'$ using Equation~(\ref{eq:marg-smc}).
  3. It forms the ratio: \begin{eqnarray} r = \frac{p(\theta')}{p(\theta)} \frac{Z'}{Z^{(j-1)}} \frac{q(\theta|\theta')}{q(\theta'|\theta)} \end{eqnarray}
  4. Sample a Bernoulli$(\min(1,r))$,
    • If accepted, set $(\theta^{(j)}, Z^{(j)}) \gets (\theta', Z')$
    • Else, set $(\theta^{(j)}, Z^{(j)}) \gets (\theta^{(j-1)}, Z^{(j-1)})$

Analysis:

For any fixed number of particles $K$, there will be an error in the approximation $Z^{(j)}$ of $Z_{\theta'}$. It follows that the MH ratio will be different than the idealized ratio, \begin{eqnarray} r^\star = \frac{p(\theta')}{p(\theta)} \frac{p(y|\theta')}{p(y|\theta)} \frac{q(\theta|\theta')}{q(\theta'|\theta)}, \end{eqnarray} and it is therefore not obvious a priori what is the asympotic behavior of PMMH if we let the number of MCMC iterations go to infinity for a fixed $K$.

It was therefore a surprise when Andrieu, Doucet and Holenstein, 2010 showed that even for fixed $K$ this converges to the right target, $p(\theta | y)$.

The basic idea is to introduce auxiliary variables modelling each step of the SMC inner loop, namely:

  • $K(N-1)$ discrete variables $d$ on $\{1, \dots, K\}$ representing all the indices from the resampling steps for customers $2, 3, \dots, N$.
  • $KN$ variables representing the partial seating arrangements $s$ stored by the $K$ particles for customers $1, 2, 3, \dots, N$.
  • One extra random variable $t$ uniformly distributed on the $K^N$ resampling trajectories $\{1, 2, \dots, K\} \times \dots \times \{1, 2, \dots, K\}$.

The auxiliary distribution we put on these random variables can be described with the following generative process:

  1. pick a value for the trajectory random variable $t$. This determines $N-1$ of the variables in (1) at the same time.
  2. fill in the $N$ variables $s$ corresponding to the sampled trajectory according to the target (intractable) distribution $p(z|\theta)$.
  3. sample the rest of the variables $s,d$ according to the standard SMC algorithm (with the difference that the sampled trajectory is forced to be resampled at least once at each generation). Call $r$ this subset of $s,d$.

Appendix B1 of Andrieu, Doucet and Holenstein, 2010 shows that PMMH can be interpreted as a standard MH algorithm on that expanded space.

This augmented construction also provides a second PMCMC algorithm called Particle Gibbs (PG). It corresponds to doing block sampling on the augmented space, with the sampling steps divided as follows:

  • $t | \textrm{rest}$
  • $r | \textrm{rest}$
  • $\theta | \textrm{rest}$.

Again, see Andrieu, Doucet and Holenstein, 2010 for detail.

Supplementary references and notes