### Inference and runtime: overview

The runtime system is responsible for loading data and performing Bayesian inference based on that data. We seek to make inference correct, general, automated and efficient (prioritized in that order):

1. correctness is approached using a comprehensive suite of tests based on Monte Carlo theory combined with software engineering methodology;

2. generality is provided with an open type system and facilities to quickly develop and test sampling algorithms for new types;

3. automation is based on inference algorithms that remove the need for careful initialization of models as well as an injection framework to ease data loading and conditioning;

4. efficiency is addressed by built-in support for correct and reproducible parallel execution, automatic detection of sparsity structure, and innovative sampling engines based on non-reversible Monte Carlo methods.

The tools for testing and creating new types are described in their own pages. We describe here the inner workings of Blang's automatic, efficient inference machinery in this page. From the user's point of view, this page can be skipped at first reading and is more targeted at developers and Monte Carlo researchers.

### Detection of sparsity patterns

In this section, we describe how a Blang model, say $$m$$ (i.e. the contents of .bl file) gets transformed into an efficient representation aware of $$m$$'s sparsity patterns. This final efficient form is an instance of blang.runtime.SampledModel, a mutable object keeping track of the state space and offering methods to:

• change the annealing parameter of the model, going from prior to posterior (see next section for details);

• apply a transition kernel in place targeting the current annealing parameter;

• perform forward simulation in place;

• get the joint log density of the current configuration;

• duplicating the state via a deep cloning library.

A quick note on mutability: the practice of modifying objects in place is typically avoided as it makes testing more involved. However it is required here for high-performance sampling and our testing framework helps ensure correctness is maintained.

The process of translation of a Blang model into a SampledModel starts with the instantiation of the model variables (which includes executing the default initialization blocks of unspecified variables).

After his is done, the Blang runtime constructs a list $$l$$ of factors. This is done recursively as follows:

• For each blang.core.Model encountered (starting at $$m$$), execute the code generated by the laws block of the model. The generated code is in a method called components(), which returns a collection $$c$$ containing objects of the type blang.core.ModelComponent, an interface encompassing blang.core.Model and blang.core.Factor.

For each item $$i$$ in this collection $$c$$,

• If $$i$$ is of type blang.core.Factor, add $$i$$ to $$l$$. This case usually corresponds to the code generated by a logf(...) { ... } or indicator(...) { ... } block, in both cases the generated factor will be a subtype of the subinterface blang.core.LogScaleFactor. It may also occur when a statement of the form variable is Constrained is encountered, in which case the instantiated object is blang.core.Constrained.

• If $$i$$ is of type blang.core.Model, the present procedure is invoked to obtain a list $$l'$$, the elements of which are all added to the present list $$l$$.

The next phase of initialization consists in building from the instantiated Blang model an accessibility graph, defined as follows. Vertices are taken as the transitive closure of objects (starting at the root model) and of the constituents of these objects. Constituents are fields in the case of objects and integer indices in the case of arrays. Exploration of a field can be skipped in the construction of the accessibility graph using the annotation @SkipDependency. Constitutents can also be customized, as we describe later, for example to index entries of matrices. The (directed) edges of the accessibility graph connect objects to their constituents, and constituents to the object they resolve to, if any. For example, a field might resolve to another object, creating an edge to that object, or to null or a primitive in which case no edge is created. We say that an object $$o_2$$ is accessible from $$o_1$$ if there is a directed path from $$o_1$$ to $$o_2$$.

The latent variables are extracted from the vertex set of the accessibility graph as the intersection of:

• objects of a type annotated with @Samplers (more precisely, object where the class itself, a superclass, or an implemented interface has the annotation @Samplers);

• objects that are mutable, or have an accessible mutable children. Mutability is defined as follows.

• By default, mutability corresponds to accessibility of non-final fields (in Java, fields not marked with the final keyword, in Xtend, fields marked with var) or arrays. This behaviour can be overwritten with the annotation @Immutable;

• To get finer control on mutability, a designated Observations object can be used to mark individual nodes as observed. For example, this can be used to mark an individual entry of a matrix as observed, using observationsObject.markAsObserved(matrix.getRealVar(i, j)). An instance of the Observations object is typically obtained via injection, as described in the input output page. When a node is marked as observed, so is all of its accessible children.

Given an accessibility graph, we are now able to efficiently answer the following questions: given a latent variable $$v$$ and a factor $$f$$, determine if the application of a sampling operator on $$v$$ can change the numerical value of the factor $$f$$.

To determine the answer to this question, we determine if $$v$$ and $$f$$ are co-accessible. Two objects $$o_1$$ and $$o_2$$ are co-accessible if there is a mutable object $$o_3$$ such that $$o_3$$ is accessible from both $$o_1$$ and $$o_2$$.

The total cost of the algorithm we use for finding, for each latent variable, all the co-accessible factors is proportional to the sum over the factors $$f_i$$ of the number of nodes accessible from $$f_i$$. The cost of this pre-processing is therefore expected to be negligible compared to the cost of sampling. See blang.runtime.internals.objectgraph.GraphAnalysis for details.

### Matching transition kernels

After identification of sparsity patterns, samplers are automatically matched to latent variables. To demonstrate how it is done, let us look for example at the declaration of the type Simplex:

... @Samplers(SimplexSampler) class DenseSimplex implements Simplex, DenseMatrix, Delegator<DenseMatrix> { ...

Here the annotation Samplers provides a comma separated list of sampling algorithms. Here is the specific sampler here:

package blang.mcmc import bayonet.distributions.Random import blang.core.Constrained import blang.types.DenseSimplex import blang.core.LogScaleFactor import java.util.List import blang.mcmc.internals.SamplerBuilderContext import blang.mcmc.internals.SimplexWritableVariable class SimplexSampler implements Sampler { @SampledVariable DenseSimplex simplex @ConnectedFactor List<LogScaleFactor> numericFactors @ConnectedFactor Constrained constrained override void execute(Random rand) { val int sampledDim = rand.nextInt(simplex.nEntries) val SimplexWritableVariable sampled = new SimplexWritableVariable(sampledDim, simplex) val RealSliceSampler slicer = RealSliceSampler::build(sampled, numericFactors, 0.0, sampled.sum) slicer.execute(rand) } override boolean setup(SamplerBuilderContext context) { return simplex.nEntries >= 2 && constrained !== null && constrained.object instanceof DenseSimplex } }

The sampler should have exactly one field annotated with @SampledVariable, it will be automatically populated with the variable being sampled.

Then the sampler can declare one or several factors or lists of factors that are supported. These will also be populated automatically by using the graph analysis machinery described above. For example, the list numericFactors in the above example will be populated with all the LogScaleFactor's that are co-accessible with the variable simplex.

Instantiation of the sampler will only be considered if all the co-accessible factors can be matched to some field with the @ConnectedFactor annotation (either directly, or into a list specifying the type of factor via its generic type parameter). This mechanism can be used to disable default samplers. For example, going back to our simplex variable example, naive methods such as one-node Metropolis-Hastings should be avoided. To do so, we use a Constrained variable (see for example in Dirichlet.bl), so that the slice sampler is not instantiated (as the Slice sampler does not declare any annotated field of type Constrained).

### Construction of a sequence of measures and forward generation

For many posterior inference methods, it is useful to have a sequence of measures parameterized by an annealing parameter $$t \in [0, 1]$$ such that $$t = 0$$ corresponds to an easy to sample distribution (e.g. the prior), and $$t = 1$$ corresponds to the distribution of interest. For example, this is necessary to apply parallel tempering or sequential change of measure methods.

Having a sequence of distributions also helps alleviate the initialization problems found in methods purely based on MCMC, which require the sampler to start at a point of positive density. With Blang, we use the intermediate distributions to soften the support restrictions, returning when going out of support a small density going to zero as the annealing parameter approaches one. This does not affect the asymptotic guarantees of our inference method as the restrictions are fully enforced at $$t = 1$$.

One challenge when constructing the sequence of measures is that we have to guarantee that all intermediate measures have a finite normalization constant, otherwise the inference algorithms may silently fail, i.e. the Monte Carlo average may diverge or converge to incorrect values. We use the following strategy to avoid this problem. First, let us view the Blang model as a directed graphical model (we describe in more detail how this is done below). In this representation, the factors correspond to nodes in the graph. We split the factors into two groups according to whether the node is observed or not. Let us denote the product of the factors in each of the two groups by $$\ell(x) = \prod_{i\in I} \ell_i(x)$$ and $$p(x) = \prod_{j\in J} p_j(x)$$ respectively (here $$x$$ encompasses all latent variables). The target posterior distribution satisfies $$\pi(x) \propto p(x)\ell(x)$$. In simple cases, $$\ell(x)$$ corresponds to the likelihood and $$p(x)$$, to the prior; however we avoid this terminology since the formal definition given below generalizes to complex models where there is not necessarily a clear-cut prior and likelihood. We do have that $$\int p(x) \text{d} x = 1$$ in general though. Assuming the problem is well posed, the posterior is also normalizable, $$\int p(x) \ell(x) \text{d} x < \infty$$, and moreover the same is true for subsets of observations, i.e. for $$K \subset I$$, $$\int p(x) \prod_{i\in K} \ell_i(x) \text{d} x < \infty$$.

Given that notation, the sequence of intermediate measures we use is given by \begin{align} \pi_t(x) = p(x) \prod_{i\in I}[(\ell_i(x))^t + I(\ell_i(x) = 0) \epsilon_t], \end{align}where $$\epsilon_t \le 1$$ is a decreasing sequence such $$\epsilon_1 = 0$$ and $$I(\cdot)$$ denotes an indicator. Concretely, we use $$\epsilon_t = \exp(-t 1\text{e}100) I(t < 1)$$.

Measures in this sequences are guaranteed to have finite normalization since: \begin{align} \int \pi_t(x) \text{d} x &= \int p(x) \prod_{i\in I}[ (\ell_i(x))^t + I(\ell_i(x) = 0) \epsilon_t ] \text{d} x\\ &\le \sum_{K: K \subset I} \epsilon_t^{|I|-|K|} \int p(x) (\prod_{i \in K}\ell_i(x))^t \text{d} x \\ &= \sum_{K: K \subset I} \epsilon_t^{|I|-|K|} \int p(x) (\prod_{i \in K}\ell_i(x))^t [I(\prod_{i \in K}\ell_i(x) \ge 1) + I(\prod_{i \in K}\ell_i(x) < 1)] \text{d} x \\ &\le \sum_{K: K \subset I} \epsilon_t^{|I|-|K|} [\int p(x) \prod_{i \in K}\ell_i(x) \text{d} x + \int p(x) \text{d} x] \\ &< \infty. \end{align}

The process by which factors are segregated between the "prior" set $$\{p_j : j\in J\}$$ and the "likelihood" set $$\{\ell_i : i\in I\}$$ proceeds recursively from the root model:

• If the model has a generate block:

• If all the model's random variables are observed (not latent), put all the numerical factors defined by this model in the set $$\{\ell_i : i\in I\}$$.

• If all the model's random variables are latent, put all the numerical factors defined by this model in the set $$\{p_j : j\in J\}$$.

• If the current model's random variables are partially observed, throw an exception.

• If the model does not have a generate block

• Recurse over all components of the model. Assume they are all themselves models (i.e. if a model does not have generate block, all items in the laws block should be based on the ... | ... ~ ... construct rather than the logf and indicator ones.

We point out that this decomposition is not possible for all models. In such cases, it is often possible to rewrite the models to follow the rules outlined above. If not, the user can also back off to inference methods that do not require a sequence of measures. This can be done with the command line arguments (also setting the inference engine to Parallel Tempering with a single chain (PT):

--checkIsDAG false --engine.usePriorSamples false --stripped true --engine PT --engine.nChains 1

### Inference algorithms

Several inference algorithms are available built-in. They can be configured by providing various arguments to the main application. To get a full list, you can always use the switch --help. Grep engine to find the subset related to inference algorithms. The selection is controlled with the switch --engine:

• SCM: Sequential Change of Measure. This pushes a population of particles from the distribution $$\pi_0$$ into $$\pi = \pi_1$$. In summary:

• First, the forward simulation machinery is used to get exact samples from $$\pi_0$$. Several particles are hence created. The number of particles is controlled with --engine.nParticles and controls the quality of the approximation.

• The annealing parameter ($$t$$ indexing $$\pi_t$$) is then increased sequentially from zero to one. At each iteration, each particle is perturbed by a transition kernel invariant with respect to the next annealed target and reweighted to take into account the annealing parameter increase. The amount of increase at each iteration is controlled by --engine.temperatureSchedule, and the default value, AdaptiveTemperatureSchedule, is based on an adaptive strategy, see Zhou, Johansen and Aston (2013).

• When the relative Effective Sample Size (ESS) falls under a threshold (specified by the command line argument --engine.resamplingESSThreshold), resampling is performed. The Annealed Importance Sampling algorithm can be recovered as special case by setting --engine.resamplingESSThreshold 0.0.

• After obtaining particles approximately distributed according to $$\pi_1 = \pi$$ the quality of the approximation can be optionally increased by performing rejuvenation steps on each particles, i.e. scans where all transition kernels targeting $$\pi$$ are applied in a random order. The number of such scans is set by --engine.nFinalRejuvenations. Set it to zero to disable rejuvenation (for example if you only care about the estimate of the probability of the data).

• The options --engine.nThreads controls the number of parallel threads, by default as many threads as half the number of idle cores are used (--engine.nThreads Dynamic). Note that the result of the computation is identical no matter how many threads are used (this is achieved with the use of separate random stream indexed by particles).

• PT: Parallel Tempering. This creates parallel MCMC chains, each targeting different annealing parameters.

• The number of different temperatures can be controlled with --engine.nChains (in particular, setting it to 1 will reduce the algorithm to a standard MCMC algorithm.

• If the option --engine.usePriorSamples is set to true (the default), then the chain at parameter zero is obtained by sampling iid from the prior distribution.

• We define a scan as doing moves on all temperatures followed by attempting all either odd or even swaps, followed by statistics collection. Use --engine.nScans to control the number of scans. Use --engine.nPassesPerScan to control how many passes to perform in each chain on each scan, where we define a pass as applying each MCMC kernel once.

• The option --engine.ladder controls the scheme determining the sequence of annealing parameters to use. Those available by default so far are Geometric and EquallySpaced. Use for example --engine.ladder Geometric --help for more details on Geometric and similarly for other schemes. To implement a custom scheme, create a class implementing blang.engines.internals.ladders.TemperatureLadder and use the fully qualified name of the form --engine.ladder mypackage.MyLadder.

• The options --engine.nThreads controls the number of parallel threads, by default as many threads as half the number of idle cores are used (--engine.nThreads Dynamic). Note that the result of the computation is identical no matter how many threads are used (this is achieved with the use of separate random stream indexed by chains).

• Forward: Use the forward simulation machinery for the subset of the states that are not observed.

• Exact: For models that are fully discrete, this enumerates all the configurations. The current algorithm does not attempt to exploit conditional independence assumptions and has exponential computational complexity. It is still useful for debugging and pedagogy.

• To use a custom inference scheme, implement the interface PosteriorInferenceEngine, and input the fully qualified name of the implementation as argument.

How the object graph is explored is controlled in blang.runtime.internals.objectgraph.ExplorationRules. To customize, one would change the list of rules in the static field defaultExplorationRules via static initialization.