The runtime system is responsible for loading data and performing Bayesian inference based on that data. We seek to make inference correct, general, automated and efficient (prioritized in that order):
correctness is approached using a comprehensive suite of tests based on Monte Carlo theory combined with software engineering methodology;
generality is provided with an open type system and facilities to quickly develop and test sampling algorithms for new types;
automation is based on inference algorithms that remove the need for careful initialization of models as well as an injection framework to ease data loading and conditioning;
efficiency is addressed by built-in support for correct and reproducible parallel execution, automatic detection of sparsity structure, and innovative sampling engines based on non-reversible Monte Carlo methods.
The tools for testing and creating new types are described in their own pages. We describe here the inner workings of Blang's automatic, efficient inference machinery in this page. From the user's point of view, this page can be skipped at first reading and is more targeted at developers and Monte Carlo researchers.
In this section, we describe how a Blang model, say \( m \) (i.e. the contents of .bl
file) gets transformed into an efficient representation aware of \( m \)'s sparsity patterns.
This final efficient form is an instance of blang.runtime.SampledModel
, a mutable object
keeping track of the state space and offering methods to:
change the annealing parameter of the model, going from prior to posterior (see next section for details);
apply a transition kernel in place targeting the current annealing parameter;
perform forward simulation in place;
get the joint log density of the current configuration;
duplicating the state via a deep cloning library.
A quick note on mutability: the practice of modifying objects in place is typically avoided as it makes testing more involved. However it is required here for high-performance sampling and our testing framework helps ensure correctness is maintained.
The process of translation of a Blang model into a SampledModel starts with the instantiation of the model variables (which includes executing the default initialization blocks of unspecified variables).
After his is done, the Blang runtime constructs a list \( l \) of factors. This is done recursively as follows:
For each blang.core.Model
encountered (starting at \( m \)),
execute the code generated by the
laws block of the model.
The generated code is in a method called components()
, which returns a collection
\( c \) containing objects of the type
blang.core.ModelComponent
, an interface encompassing
blang.core.Model
and blang.core.Factor
.
For each item \( i \) in this collection \( c \),
If \( i \) is of type blang.core.Factor
, add \( i \)
to \( l \).
This case usually corresponds to the code generated by
a logf(...) { ... }
or indicator(...) { ... }
block,
in both cases the generated factor will be a
subtype of the subinterface blang.core.LogScaleFactor
.
It may also occur when a statement of the form variable is Constrained
is encountered,
in which case the instantiated object is blang.core.Constrained
.
If \( i \) is of type blang.core.Model
, the present procedure
is invoked to obtain a list \( l' \), the elements of which are all added
to the present list \( l \).
The next phase of initialization consists in building from the
instantiated Blang model an accessibility graph,
defined as follows. Vertices are taken as the transitive closure of objects (starting at the root model)
and of the constituents of these objects.
Constituents are fields in the case of objects and integer indices in the
case of arrays.
Exploration of a field can be skipped in the construction of the accessibility graph
using the annotation @SkipDependency
.
Constitutents can also be customized, as we describe later, for example to index entries of matrices.
The (directed) edges of the accessibility graph connect
objects to their constituents, and constituents to the object they resolve
to, if any. For example, a field might resolve to another object, creating an edge to that object,
or to null or a primitive in
which case no edge is created. We say that an object \( o_2 \) is accessible
from \( o_1 \) if there is a directed path from \( o_1 \) to \( o_2 \).
The latent variables are extracted from the vertex set of the accessibility graph as the intersection of:
objects of a type annotated with @Samplers
(more precisely, object where the class
itself, a superclass, or
an implemented interface has the annotation @Samplers
);
objects that are mutable, or have an accessible mutable children. Mutability is defined as follows.
By default,
mutability corresponds to accessibility of non-final fields (in Java, fields not marked with the
final
keyword, in Xtend, fields marked with var
) or arrays.
This behaviour can be overwritten with the annotation @Immutable
;
To get finer control on mutability, a designated Observations
object can be used
to mark individual nodes as observed. For example, this can be used to mark an individual entry
of a matrix as observed, using observationsObject.markAsObserved(matrix.getRealVar(i, j))
.
An instance of the Observations object is typically obtained via injection, as described in the
input output page. When a node is marked as observed, so is all
of its accessible children.
Given an accessibility graph, we are now able to efficiently answer the following questions: given a latent variable \( v \) and a factor \( f \), determine if the application of a sampling operator on \( v \) can change the numerical value of the factor \( f \).
To determine the answer to this question, we determine if \( v \) and \( f \) are co-accessible. Two objects \( o_1 \) and \( o_2 \) are co-accessible if there is a mutable object \( o_3 \) such that \( o_3 \) is accessible from both \( o_1 \) and \( o_2 \).
The total cost of the algorithm we use for finding, for each latent variable, all the co-accessible factors is proportional
to the sum over the factors \( f_i \) of the number of nodes accessible from \( f_i \).
The cost of this pre-processing is therefore expected to be negligible compared to the cost of sampling.
See blang.runtime.internals.objectgraph.GraphAnalysis
for details.
After identification of sparsity patterns, samplers are automatically matched to latent variables.
To demonstrate how it is done, let us look for example at the declaration
of the type Simplex
:
Here the annotation Samplers
provides a comma separated list of sampling algorithms. Here is the
specific sampler here:
The sampler should have exactly one field annotated with @SampledVariable
, it will be automatically
populated with the variable being sampled.
Then the sampler can declare one or several factors or lists of factors that are supported. These will also be
populated automatically by using the graph analysis machinery described above. For example, the list
numericFactors
in the above example will be populated with all the LogScaleFactor
's that
are co-accessible with the variable simplex
.
Instantiation of the sampler will only be considered if all the
co-accessible factors can be matched to some field with the @ConnectedFactor
annotation (either directly,
or into a list specifying the type of factor via its generic type parameter).
This mechanism can be used to disable default samplers. For example, going back to our simplex variable example,
naive methods such as
one-node Metropolis-Hastings should be avoided. To do so, we use a Constrained
variable (see for
example in Dirichlet.bl), so that the slice sampler is not instantiated (as the Slice sampler does not declare any annotated
field of type Constrained).
For many posterior inference methods, it is useful to have a sequence of measures parameterized by an annealing parameter \( t \in [0, 1] \) such that \( t = 0 \) corresponds to an easy to sample distribution (e.g. the prior), and \( t = 1 \) corresponds to the distribution of interest. For example, this is necessary to apply parallel tempering or sequential change of measure methods.
Having a sequence of distributions also helps alleviate the initialization problems found in methods purely based on MCMC, which require the sampler to start at a point of positive density. With Blang, we use the intermediate distributions to soften the support restrictions, returning when going out of support a small density going to zero as the annealing parameter approaches one. This does not affect the asymptotic guarantees of our inference method as the restrictions are fully enforced at \( t = 1 \).
One challenge when constructing the sequence of measures is that we have to guarantee that all intermediate measures have a finite normalization constant, otherwise the inference algorithms may silently fail, i.e. the Monte Carlo average may diverge or converge to incorrect values. We use the following strategy to avoid this problem. First, let us view the Blang model as a directed graphical model (we describe in more detail how this is done below). In this representation, the factors correspond to nodes in the graph. We split the factors into two groups according to whether the node is observed or not. Let us denote the product of the factors in each of the two groups by \( \ell(x) = \prod_{i\in I} \ell_i(x) \) and \( p(x) = \prod_{j\in J} p_j(x) \) respectively (here \( x \) encompasses all latent variables). The target posterior distribution satisfies \( \pi(x) \propto p(x)\ell(x) \). In simple cases, \( \ell(x) \) corresponds to the likelihood and \( p(x) \), to the prior; however we avoid this terminology since the formal definition given below generalizes to complex models where there is not necessarily a clear-cut prior and likelihood. We do have that \( \int p(x) \text{d} x = 1 \) in general though. Assuming the problem is well posed, the posterior is also normalizable, \( \int p(x) \ell(x) \text{d} x < \infty \), and moreover the same is true for subsets of observations, i.e. for \( K \subset I \), \( \int p(x) \prod_{i\in K} \ell_i(x) \text{d} x < \infty \).
Given that notation, the sequence of intermediate measures we use is given by \[\begin{align} \pi_t(x) = p(x) \prod_{i\in I}[(\ell_i(x))^t + I(\ell_i(x) = 0) \epsilon_t], \end{align}\]where \( \epsilon_t \le 1 \) is a decreasing sequence such \( \epsilon_1 = 0 \) and \( I(\cdot) \) denotes an indicator. Concretely, we use \( \epsilon_t = \exp(-t 1\text{e}100) I(t < 1) \).
Measures in this sequences are guaranteed to have finite normalization since: \[\begin{align} \int \pi_t(x) \text{d} x &= \int p(x) \prod_{i\in I}[ (\ell_i(x))^t + I(\ell_i(x) = 0) \epsilon_t ] \text{d} x\\ &\le \sum_{K: K \subset I} \epsilon_t^{|I|-|K|} \int p(x) (\prod_{i \in K}\ell_i(x))^t \text{d} x \\ &= \sum_{K: K \subset I} \epsilon_t^{|I|-|K|} \int p(x) (\prod_{i \in K}\ell_i(x))^t [I(\prod_{i \in K}\ell_i(x) \ge 1) + I(\prod_{i \in K}\ell_i(x) < 1)] \text{d} x \\ &\le \sum_{K: K \subset I} \epsilon_t^{|I|-|K|} [\int p(x) \prod_{i \in K}\ell_i(x) \text{d} x + \int p(x) \text{d} x] \\ &< \infty. \end{align}\]
The process by which factors are segregated between the "prior" set \( \{p_j : j\in J\} \) and the "likelihood" set \( \{\ell_i : i\in I\} \) proceeds recursively from the root model:
If the model has a generate
block:
If all the model's random
variables are observed (not latent),
put all the numerical factors defined by this model in the set \( \{\ell_i : i\in I\} \).
If all the model's random
variables are latent, put all the
numerical factors defined by this model in the set \( \{p_j : j\in J\} \).
If the current model's random
variables are partially observed,
throw an exception.
If the model does not have a generate
block
Recurse over all components of the model. Assume they are all themselves
models (i.e. if a model does not have generate
block, all
items in the laws
block should be based on the ... | ... ~ ...
construct rather than the logf
and indicator
ones.
We point out that this decomposition is not possible for all models. In such cases, it is often possible to rewrite the models to follow the rules outlined above. If not, the user can also back off to inference methods that do not require a sequence of measures. This can be done with the command line arguments (also setting the inference engine to Parallel Tempering with a single chain (PT):
Several inference algorithms are available built-in.
They can be configured by providing various arguments to the main application. To get a full list,
you can always use the switch --help
. Grep engine
to find the subset
related to inference algorithms.
The selection is controlled with the
switch --engine
:
SCM
: Sequential Change of Measure. This pushes a population of particles from
the distribution \( \pi_0 \) into \( \pi = \pi_1 \). In summary:
First, the forward simulation machinery is used to get
exact samples from \( \pi_0 \). Several particles are hence created.
The number of particles is controlled with --engine.nParticles
and controls
the quality of the approximation.
The annealing parameter (\( t \) indexing \( \pi_t \)) is then increased
sequentially from zero to one.
At each iteration, each particle is perturbed by a transition kernel invariant with respect
to the next annealed target and reweighted to take into account the annealing parameter increase.
The amount of increase at each
iteration is controlled by --engine.temperatureSchedule
, and the default value,
AdaptiveTemperatureSchedule
, is based on an adaptive strategy, see
Zhou, Johansen and Aston (2013).
When the relative Effective Sample Size (ESS) falls under a threshold
(specified by the command line argument --engine.resamplingESSThreshold
), resampling is performed.
The Annealed Importance Sampling algorithm
can be recovered as special case by
setting --engine.resamplingESSThreshold 0.0
.
After obtaining particles approximately distributed according to \( \pi_1 = \pi \) the quality of the
approximation can be optionally increased by performing rejuvenation steps on each particles, i.e. scans
where all transition kernels targeting \( \pi \) are applied in a random order. The number of
such scans is set by --engine.nFinalRejuvenations
. Set it to zero to disable rejuvenation
(for example if you only care about the estimate of the probability of the data).
The options --engine.nThreads
controls the number of parallel threads,
by default as many threads as half the number of idle cores are used (--engine.nThreads Dynamic
).
Note that the result of the computation is identical no matter how many
threads are used (this is achieved with the use of separate random stream indexed by particles).
PT
: Parallel Tempering. This creates parallel MCMC chains, each targeting different annealing
parameters.
The number of different temperatures can be controlled with --engine.nChains
(in particular,
setting it to 1 will reduce the algorithm to a standard MCMC algorithm.
If the option --engine.usePriorSamples
is set to true (the default), then the chain at parameter zero
is obtained by sampling iid from the prior distribution.
We define a scan as doing moves on all temperatures followed by attempting all either
odd or even swaps, followed by statistics collection.
Use --engine.nScans
to control the number of scans. Use
--engine.nPassesPerScan
to control how many passes to perform in
each chain on each scan, where we define a pass as applying each MCMC kernel once.
The option --engine.ladder
controls the scheme determining the
sequence of annealing parameters to use. Those available by default so far
are Geometric
and EquallySpaced
. Use for example
--engine.ladder Geometric --help
for more details on Geometric and
similarly for other schemes. To implement a custom scheme, create a class
implementing blang.engines.internals.ladders.TemperatureLadder
and use the fully qualified name of the form --engine.ladder mypackage.MyLadder
.
The options --engine.nThreads
controls the number of parallel threads,
by default as many threads as half the number of idle cores are used (--engine.nThreads Dynamic
).
Note that the result of the computation is identical no matter how many
threads are used (this is achieved with the use of separate random stream indexed by chains).
Forward
: Use the forward simulation machinery for the subset of the states that are not
observed.
Exact
: For models that are fully discrete, this enumerates all the configurations. The current
algorithm
does not attempt to exploit conditional independence assumptions and has exponential computational complexity.
It is still useful for debugging and pedagogy.
To use a custom inference scheme, implement the interface
PosteriorInferenceEngine
, and input the
fully qualified name of the implementation as argument.
The construction of the object graph (used for determining sparsity patterns) can be customized. The main use case for such customization is to construct views into larger objects, e.g. slices of matrices.
How the object graph is explored is controlled in blang.runtime.internals.objectgraph.ExplorationRules
.
To customize, one would change the list of rules in the static field defaultExplorationRules
via
static initialization.