Input and output

Input and output: overview

In this page, we cover

Input: how to load data. This is used for:

fixing a random variable's value to a given observation (conditioning),
setting the hyper-parameters of models,
setting the tuning parameters of inference algorithms.

Output: how to control the output of samples when custom types are used.

Input

Inputs are controlled using the injection framework inits which is designed for dependency injection in the context of scientific models. This is entirely automatic for existing Blang types, only read this section if you want to create custom data types and would like to condition on them, i.e. load data and fix the variable to that loaded value.

To summarize, instantiation of arbitrary types is approached recursively with these main schemes:

When instantiating a class:

A constructor or static factory is selected by looking for the annotation @ProvidesFactory in the file declaring the custom type. See also the class Parsers which contains examples for basic types e.g. those from the JVM or from xlinear. As a fall-back default, using a no-argument constructor, if available, will be attempted.
Each argument in this constructor or static factories should be annotated as follows:

For arguments to be read from the command line, use @ConstructorArg(value = "nameOfArg"). The type of each argument will be recursively inspected to figure out how to parse it.
To bootstrap the process, you can also declare an argument @Input String string or @Input List strings and parse that provided string or strings manually.
To mark certain entries as observed, you can make the random variable immutable. Alternatively, you can mark subgraphs of the accessibility graph as observed by declaring a constructor argument @GlobalArg Observations initContext and then calling initContext.markAsObserved(object).
To recursively parse other strings to be converted to arbitrary types, declare a constructor argument @InitService Creator creator and call creator.init(type, arguments) where type can be a class literal (such as String, Integer) or an instance of TypeLiteral. Arguments can be obtained via SimpleParser.parse(string) in most cases.

As a short-hand, it is also possible to annotate fields with @Arg, this will cause them to be populated automatically after calling the constructor or static factory.
Both for @Arg and @ConstructorArg, you can give a default value to the argument via @DefaultValue, or make it optional by enclosing the declared type into an Optional<..>.

When instantiating an interface, the following is also available:

Add the annotation @Implementations to the interface, with a list of comma-separated implementations.
Then follow the above process for each implementation.

Enumerations (enum) are taken care of automatically.

For more information, see the README.md file in the inits repository.

As a convention, we use the string NA to mean unobserved (latent). This string can be accessed in a type safe manner via NA:SYMBOL.

Argument parsing is automatically taken care of (by introspection of the injection framework's annotations). Naming of switches is done hierarchically.

Here is a concrete example to show how it works. In Blang's main class, there is an annotated field @Arg PosteriorInferenceEngine engine. This type declares the following implementations:

package blang.engines.internals; import blang.engines.internals.factories.AIS; import blang.engines.internals.factories.Exact; import blang.engines.internals.factories.Forward; import blang.engines.internals.factories.MCMC; import blang.engines.internals.factories.None; import blang.engines.internals.factories.PT; import blang.engines.internals.factories.SCM; import blang.inits.Implementations; import blang.runtime.SampledModel; import blang.runtime.internals.objectgraph.GraphAnalysis; @Implementations({SCM.class, PT.class, MCMC.class, AIS.class, Forward.class, Exact.class, None.class}) public interface PosteriorInferenceEngine { public void setSampledModel(SampledModel model); public void performInference(); public void check(GraphAnalysis analysis); }

Now let's look at one of those implementations, say SCM. SCM's parent class is AdaptiveJarzynski, which declares @Arg Cores nThreads.

In turn, the Core declares the following static factory:

@DesignatedConstructor public Cores( @Input(formatDescription = "Integer - skip or " + MAX + " to use max available") Optional<String> input) { ... }

This creates the following command line options (described here by a snippet of what is produced by --help:

Output

Every Blang execution creates a unique directory. The path is output to standard out at the end of the run. The latest run is also softlinked at results/latest.

The directory has the following structure:

samples/: samples from the target distribution. By default each random variable in the running model is output for each iteration (to disable for some variables, e.g. those that are fully observed, use --excludeFromOutput). We describe the format in more detail below.
logNormalizationEstimate.csv: estimate of the natural logarithm of the probability of the data (also known as the log of the normalization constant of the prior times the likelihood, integrating over the latent). Only available for certain inference engines such as SCM.
arguments*: arguments used in this run.
executionInfo/: additional information for reproducibility (JVM arguments, standard out, etc). To automatically extract code version, use --experimentConfigs.recordGitInfo true.
monitoring/: diagnostic for the samplers.

The samples are stored in tidy csv files. For example, two samples for a list of two RealVar's would look like:

index_0,sample,value 0,0,0.45370104866569855 1,0,0.38696647209956947 2,0,0.42871560465749226 0,1,0.5107038773755743 1,1,0.34488603941828144 2,1,0.40406618985385023

By default, the method toString is used to create the last column (value). This behaviour can be customized to follow the tidy philosophy. To do so, implement the interface TidilySerializable (example available here).

The following command line arguments can be used to tune the output:

--excludeFromOutput: space separated list of random variables to exclude from output.
--experimentConfigs.managedExecutionFolder: set to false to output in the current folder instead of in the unique folder created in results/all.
--experimentConfigs.recordExecutionInfo: set to false to skip recording reproducibility information in executionInfo.
--experimentConfigs.recordGitInfo: set to true to record git repo info for the code.
--experimentConfigs.saveStandardStreams: set to false to skip recording the standard out and err.
--experimentConfigs.tabularWriter: by default, CSV. Can set to Spark to organize tidy output into a hierarchy of directories each having a csv (with less column as many columns are in this format now inferable from the names of the parent directories). In certain scenario this could save disk space. Inter-operable with Spark.

Blang

Input and output: overview

Input

Customization of the input when creating new types

Missing data

Providing arguments from the command line

Output

Organization

Format of the samples

Output options