There is considerable emphasis in the MCMC and SMC literatures on efficiency (in the computational sense of efficiency), but much less on correctness. Here we define correctness as follows: an MCMC procedure producing samples \( X_1, X_2, \dots \) is correct if ergodic averages based any integrable functions \( f \) admit a law of large number converging to the posterior expectation of the function under the target posterior distribution \( \pi \), i.e.
\[\begin{align} \frac{1}{N} \sum_{i=1}^N f(X_i) \to \int f(x) \pi(x) \text{d}x \text{ a.s., } N\to\infty. \end{align}\]
For SMC-based algorithm, we use a similar definition where instead the number of particles is increased to infinity.
Correctness can fail for two main reasons: some mathematical derivations used to construct the MCMC or SMC algorithm are incorrect, or, the software implementation is buggy. The tests we describe here can detect both kinds of problems.
Correctness has nothing to do with what the literature vaguely calls "convergence diagnostic". Whether a LLN holds or not does not depend on "burn-in", having low autocorrelation, etc.
In full generality, testing if a deterministic sequence converges to a certain limit is not possible. On top of that, the sequence we have here is composed of random variables, and moreover these random variables are relatively expensive to simulate in problems of practical interest. Hence our definition of correctness appears hopelessly difficult to check.
Surprisingly, there exists a set of effective tools to check MCMC and SMC correctness. Some of these tools are in the literature but not emphasized. We compile and improve them in this document, and describe their implementation in Blang.
Analysis of MCMC and SMC correctness is a great example of an area where fairly theoretical concepts can have a large impact to a practical problem. Crucially, as we show here, MC theory will allow us to derive useful tests that are not based on asymptotics and that have exact, easy to compute false positive probabilities. Moreover, in some cases, we can even get non-probabilistic tests, i.e. with zero false positive and zero false negative probabilities. In other cases, tests do not have formal guarantees to find all bugs, but in practice they are useful to identify common ones.
We provide a non-standard replacement implementation of bayonet.distributions.Random which can be used to enumerates all the probability traces used by an arbitrary discrete random process. In particular, many inference engines' code manipulate models through interfaces that are agnostic to the model being continuous or discrete, so we can achieve code coverage of the inference engines using discrete models. See bayonet.distributions.ExhaustiveRandom.
We use this for example to test the unbiasness of the normalization constant estimate provided by our SMC implementation.
This can be called with a small finite model, e.g. a short HMM here, but making it is large enough to achieve code coverage (Blang and the BlangIDE are compatible with the eclemma code coverage tool).
The output of the test has the form:
Showing that indeed our implementation of SMC is unbiased.
This can also be used to test numerically that transition probabilities of small state space discrete kernels are indeed invariant with respect to the target (facilities to help automating this will be developed as part of a future release).
Finally, when a model is fully discrete and all generate{..} blocks have the property that for each realization there is at most one execution trace generating it, then we can check that the logf and randomness used in the generate block match by using the arguments:
The basis of this test is from Geweke (2004) (this is the Geweke paper on correctness of MCMC, not to be confused with Geweke's convergence diagnostic, which as mentioned above, is unrelated), but we modify it in an important way.
Given a Blang model,
as in the Geweke paper, we assume it supports two methods for simulating the random variables in the model: one via forward simulation (and since
no data is used in this correctness check, it is reasonable to assume all variables can be filled in this fashion; this will be
true as long appropriate generate
blocks are provided for the constituents); the other, via application of MCMC transition
kernels.
Let \( X \) denote all the variables in the model. Let \( X \sim \pi \) denote the forward simulation process. Let \( X' | X \sim T(\cdot | X) \) denote a combination of kernels including those we want to test for invariance with respect to \( \pi \). Let \( T_i \) denote the individual kernels which are combined into \( T \) via either mixture or composition. Typically, \( T \) is irreducible but not the individual \( T_i \)'s.
Both our test and Geweke's depend on one or several real-valued test functions \( f \), and on comparing two sets of simulations. However these two sets will be defined differently.
In Geweke's method, the two sets are:
The marginal-conditional simulator:
For \( m \in \{1, 2, \dots, M_1\} \)
\( X_m \sim \pi \)
\( F_m = f(X_m) \)
The successive-conditional simulator:
\( X_1 \sim \pi \)
\( G_1 = f(X_1) \)
For \( m \in \{2, 3, \dots, M_2\} \)
\( X_m | X_{m-1} \sim T(\cdot|X_{m-1}) \). Here our definition of \( T \) hides some details in Geweke's method, in particular that one of the \( T_i \)'s re-generate the data given the current parameter values.
\( G_m = f(X_m) \)
Then an approximate test comparing \( \{F_1, F_2, \dots, F_{M_1}\} \) and \( \{G_1, G_2, \dots, G_{M_2}\} \) is derived based on an asymptotic result.
The method has several limitations:
The validity of the approximate test relies on \( T \) being irreducible, which means that individual kernels \( T_i \) cannot be tested in isolation. Therefore, when the test fails, it can be time consuming to determine the root cause.
Whether the \( p \) value exceed a set threshold cannot be known exactly. A leap of faith is needed to assume the asymptotics are sufficiently accurate. Verifying if this is the case can be very difficult in practice. More seriously, the problem is compounded when several such tests need to be combined using a multiple-testing framework. As a consequence Geweke test is not used as an automatic test unit to the best of our knowledge and practitioners typically recommend visual inspection of P-P plots rather than turning Geweke tests into unit tests.
The validity of the approximate test also relies on a CLT for Markov chains to hold, which in turn typically involves establishing geometric ergodicity. Proving geometric ergodicity is model-dependent and quite involved compared to the weaker conditions required for a law of large numbers.
We call our alternative the Exact Invariance Test (EIT), and we use it heavily to establish Blang's SDK correctness. EIT has the following properties:
The test does not rely on irreducibility. This means that individual kernels \( T_i \) can be tested individually, which markedly narrows down the code to be reviewed in the event of bug detection.
EIT does not rely on asymptotic results. It is an exact test.
The test does not rely on geometric ergodicity.
EIT compares the following two sets of samples, \( \{F_1, F_2, \dots, F_{M_1}\} \) and \( \{H_1, H_2, \dots, H_{M_3}\} \):
Those coming from the marginal-conditional simulator, as described above, \( \{F_1, F_2, \dots, F_{M_1}\} \).
The exact invariant simulator:
For \( m \in \{1, 3, \dots, M_3\} \)
\( X_{1,m} \sim \pi \)
For \( k \in \{2, 3, \dots, K\} \)
\( X_{k,m} | X_{k-1,m} \sim T_i(\cdot|X_{k-1,m}) \)
\( H_m = f(X_{K,m}) \)
By construction, for any \( K \ge 1, j \in \{1, \dots, M_1\}, l \in \{1, \dots, M_3\} \), the random variable \( F_j \) is equal in distribution to the random variable \( F_l \) if and only if the kernel \( T_i \) is \( \pi \) invariant. This means that an exact test can be trivially constructed (for example if \( f \) takes on a finite number of values, Fisher's exact test can be used). Alternatively, tests with simple to analyze asymptotics such as the Kolmogorovâ€“Smirnov can be used. Critically, since the \( H_m \)'s are independent, the asymptotics here do not depend on irreducibility or geometric ergodicity, and off-the-shelf iid tests can be used directly. The terminology "Exact" in EIT refers to the random variables being exactly equal in distribution under the null rather than the exactness of the frequentist procedure being used to assess if indeed the two sets are equal in distribution. Note that we have here a rare instance where a point null hypothesis is indeed a well grounded approach, even when using very large values for \( M_1, M_3 \).
Here \( K \ge 1 \) is a parameter controlling the power of test. Exact tests are available for any finite value.
For completeness, we also review below the Cook et al. test.
For \( m \in \{1, 3, \dots, M\} \)
\( \tilde X_{m} \sim \pi \). A subset of the coordinates is held fix for the rest and viewed as synthetic data (i.e. in contrast to Geweke's test, \( T \) does not contain kernels modifying the "data" coordinates of \( X \)).
Set \( X_{1,m} \) to be equal to \( \tilde X_{m} \) for observed coordinates, and to some user-specified initialization value otherwise.
For \( k \in \{2, 3, \dots, K_2\} \)
\( X_{k,m} | X_{k-1,m} \sim T(\cdot|X_{k-1,m}) \)
Compute \( Q_m = \frac{1}{K_2} \sum_{k=1}^{K_2} I[f(\tilde X_{m}) < f(X_{k,m})] \)
Then if the code is correct, the distribution of \( Q_i \) converges to the uniform distribution as \( K_2 \to \infty \), provided the chain is irreducible. Hence this method suffers from the same limitation as Geweke's in terms of not being able to narrow down the error to a single \( T_i \), and being approximate for any finite \( K_2 \).
Prepare the EIS tests by assembling objects of type Instance
, for which the constructor's first argument
is a Blang model, and the following arguments are one or more test functions, playing the role of \( f \) in the
previous section. See for example
the list of
instances used to test Blang's SDK.
Once the instances are available, you can create an automatic test suite by adding a file under the src/test
directory based on the following example:
This takes multiple testing correction under account. The example above also shows a related test, DeterminismTest
which ensure reproducibility, i.e. that using the same random seed leads to exactly the same result.
By virtue of being standard JUnit tests, it is easy to use continuous integration tools such as Travis so that the tests are ran on a remote server each time a commit is made.
For individual univariate continuous distributions, this test checks using numerical integration that the normalization is one.
Having correct normalization is important when we put priors on the parameters and sample the values of the parameters. In such cases the value of the normalization constant plays an often overlooked role in the correctness of transition kernels.