Blang

Inference and runtime: overview

The runtime system is responsible for loading data and performing Bayesian inference based on that data. We seek to make inference correct, general, automated and efficient (prioritized in that order):

  1. correctness is approached using a comprehensive suite of tests based on Monte Carlo theory combined with software engineering methodology;

  2. generality is provided with an open type system and facilities to quickly develop and test sampling algorithms for new types;

  3. automation is based on inference algorithms that remove the need for careful initialization of models as well as an injection framework to ease data loading and conditioning;

  4. efficiency is addressed by built-in support for correct and reproducible parallel execution, automatic detection of sparsity structure, and innovative sampling engines based on non-reversible Monte Carlo methods.

The tools for testing and creating new types are described in their own pages. We describe here the inner workings of Blang's automatic, efficient inference machinery in this page. From the user's point of view, this page can be skipped at first reading and is more targeted at developers and Monte Carlo researchers.

Detection of sparsity patterns

In this section, we describe how a Blang model, say \( m \) (i.e. the contents of .bl file) gets transformed into an efficient representation aware of \( m \)'s sparsity patterns. This final efficient form is an instance of blang.runtime.SampledModel, a mutable object keeping track of the state space and offering methods to:

A quick note on mutability: the practice of modifying objects in place is typically avoided as it makes testing more involved. However it is required here for high-performance sampling and our testing framework helps ensure correctness is maintained.

The process of translation of a Blang model into a SampledModel starts with the instantiation of the model variables (which includes executing the default initialization blocks of unspecified variables).

After his is done, the Blang runtime constructs a list \( l \) of factors. This is done recursively as follows:

The next phase of initialization consists in building from the instantiated Blang model an accessibility graph, defined as follows. Vertices are taken as the transitive closure of objects (starting at the root model) and of the constituents of these objects. Constituents are fields in the case of objects and integer indices in the case of arrays. Exploration of a field can be skipped in the construction of the accessibility graph using the annotation @SkipDependency. Constitutents can also be customized, as we describe later, for example to index entries of matrices. The (directed) edges of the accessibility graph connect objects to their constituents, and constituents to the object they resolve to, if any. For example, a field might resolve to another object, creating an edge to that object, or to null or a primitive in which case no edge is created. We say that an object \( o_2 \) is accessible from \( o_1 \) if there is a directed path from \( o_1 \) to \( o_2 \).

The latent variables are extracted from the vertex set of the accessibility graph as the intersection of:

Given an accessibility graph, we are now able to efficiently answer the following questions: given a latent variable \( v \) and a factor \( f \), determine if the application of a sampling operator on \( v \) can change the numerical value of the factor \( f \).

To determine the answer to this question, we determine if \( v \) and \( f \) are co-accessible. Two objects \( o_1 \) and \( o_2 \) are co-accessible if there is a mutable object \( o_3 \) such that \( o_3 \) is accessible from both \( o_1 \) and \( o_2 \).

The total cost of the algorithm we use for finding, for each latent variable, all the co-accessible factors is proportional to the sum over the factors \( f_i \) of the number of nodes accessible from \( f_i \). The cost of this pre-processing is therefore expected to be negligible compared to the cost of sampling. See blang.runtime.internals.objectgraph.GraphAnalysis for details.

Matching transition kernels

After identification of sparsity patterns, samplers are automatically matched to latent variables. To demonstrate how it is done, let us look for example at the declaration of the type Simplex:

... @Samplers(SimplexSampler) class DenseSimplex implements Simplex, DenseMatrix, Delegator<DenseMatrix> { ...

Here the annotation Samplers provides a comma separated list of sampling algorithms. Here is the specific sampler here:

package blang.mcmc import bayonet.distributions.Random import blang.core.Constrained import blang.types.DenseSimplex import blang.core.LogScaleFactor import java.util.List import blang.mcmc.internals.SamplerBuilderContext import blang.mcmc.internals.SimplexWritableVariable class SimplexSampler implements Sampler { @SampledVariable DenseSimplex simplex @ConnectedFactor List<LogScaleFactor> numericFactors @ConnectedFactor Constrained constrained override void execute(Random rand) { val int sampledDim = rand.nextInt(simplex.nEntries) val SimplexWritableVariable sampled = new SimplexWritableVariable(sampledDim, simplex) val RealSliceSampler slicer = RealSliceSampler::build(sampled, numericFactors, 0.0, sampled.sum) slicer.execute(rand) } override boolean setup(SamplerBuilderContext context) { return simplex.nEntries >= 2 && constrained !== null && constrained.object instanceof DenseSimplex } }

The sampler should have exactly one field annotated with @SampledVariable, it will be automatically populated with the variable being sampled.

Then the sampler can declare one or several factors or lists of factors that are supported. These will also be populated automatically by using the graph analysis machinery described above. For example, the list numericFactors in the above example will be populated with all the LogScaleFactor's that are co-accessible with the variable simplex.

Instantiation of the sampler will only be considered if all the co-accessible factors can be matched to some field with the @ConnectedFactor annotation (either directly, or into a list specifying the type of factor via its generic type parameter). This mechanism can be used to disable default samplers. For example, going back to our simplex variable example, naive methods such as one-node Metropolis-Hastings should be avoided. To do so, we use a Constrained variable (see for example in Dirichlet.bl), so that the slice sampler is not instantiated (as the Slice sampler does not declare any annotated field of type Constrained).

Construction of a sequence of measures and forward generation

For many posterior inference methods, it is useful to have a sequence of measures parameterized by an annealing parameter \( t \in [0, 1] \) such that \( t = 0 \) corresponds to an easy to sample distribution (e.g. the prior), and \( t = 1 \) corresponds to the distribution of interest. For example, this is necessary to apply parallel tempering or sequential change of measure methods.

Having a sequence of distributions also helps alleviate the initialization problems found in methods purely based on MCMC, which require the sampler to start at a point of positive density. With Blang, we use the intermediate distributions to soften the support restrictions, returning when going out of support a small density going to zero as the annealing parameter approaches one. This does not affect the asymptotic guarantees of our inference method as the restrictions are fully enforced at \( t = 1 \).

One challenge when constructing the sequence of measures is that we have to guarantee that all intermediate measures have a finite normalization constant, otherwise the inference algorithms may silently fail, i.e. the Monte Carlo average may diverge or converge to incorrect values. We use the following strategy to avoid this problem. First, let us view the Blang model as a directed graphical model (we describe in more detail how this is done below). In this representation, the factors correspond to nodes in the graph. We split the factors into two groups according to whether the node is observed or not. Let us denote the product of the factors in each of the two groups by \( \ell(x) = \prod_{i\in I} \ell_i(x) \) and \( p(x) = \prod_{j\in J} p_j(x) \) respectively (here \( x \) encompasses all latent variables). The target posterior distribution satisfies \( \pi(x) \propto p(x)\ell(x) \). In simple cases, \( \ell(x) \) corresponds to the likelihood and \( p(x) \), to the prior; however we avoid this terminology since the formal definition given below generalizes to complex models where there is not necessarily a clear-cut prior and likelihood. We do have that \( \int p(x) \text{d} x = 1 \) in general though. Assuming the problem is well posed, the posterior is also normalizable, \( \int p(x) \ell(x) \text{d} x < \infty \), and moreover the same is true for subsets of observations, i.e. for \( K \subset I \), \( \int p(x) \prod_{i\in K} \ell_i(x) \text{d} x < \infty \).

Given that notation, the sequence of intermediate measures we use is given by \[\begin{align} \pi_t(x) = p(x) \prod_{i\in I}[(\ell_i(x))^t + I(\ell_i(x) = 0) \epsilon_t], \end{align}\]where \( \epsilon_t \le 1 \) is a decreasing sequence such \( \epsilon_1 = 0 \) and \( I(\cdot) \) denotes an indicator. Concretely, we use \( \epsilon_t = \exp(-t 1\text{e}100) I(t < 1) \).

Measures in this sequences are guaranteed to have finite normalization since: \[\begin{align} \int \pi_t(x) \text{d} x &= \int p(x) \prod_{i\in I}[ (\ell_i(x))^t + I(\ell_i(x) = 0) \epsilon_t ] \text{d} x\\ &\le \sum_{K: K \subset I} \epsilon_t^{|I|-|K|} \int p(x) (\prod_{i \in K}\ell_i(x))^t \text{d} x \\ &= \sum_{K: K \subset I} \epsilon_t^{|I|-|K|} \int p(x) (\prod_{i \in K}\ell_i(x))^t [I(\prod_{i \in K}\ell_i(x) \ge 1) + I(\prod_{i \in K}\ell_i(x) < 1)] \text{d} x \\ &\le \sum_{K: K \subset I} \epsilon_t^{|I|-|K|} [\int p(x) \prod_{i \in K}\ell_i(x) \text{d} x + \int p(x) \text{d} x] \\ &< \infty. \end{align}\]

The process by which factors are segregated between the "prior" set \( \{p_j : j\in J\} \) and the "likelihood" set \( \{\ell_i : i\in I\} \) proceeds recursively from the root model:

We point out that this decomposition is not possible for all models. In such cases, it is often possible to rewrite the models to follow the rules outlined above. If not, the user can also back off to inference methods that do not require a sequence of measures. This can be done with the command line arguments (also setting the inference engine to Parallel Tempering with a single chain (PT):

--checkIsDAG false --engine.usePriorSamples false --stripped true --engine PT --engine.nChains 1 --engine.initialization COPIES

Inference algorithms

Several inference algorithms are available built-in. They can be configured by providing various arguments to the main application. To get a full list, you can always use the switch --help. Grep engine to find the subset related to inference algorithms. The selection is controlled with the switch --engine:

Advanced graph analysis

The construction of the object graph (used for determining sparsity patterns) can be customized. The main use case for such customization is to construct views into larger objects, e.g. slices of matrices.

How the object graph is explored is controlled in blang.runtime.internals.objectgraph.ExplorationRules. To customize, one would change the list of rules in the static field defaultExplorationRules via static initialization.