### Blang

All you need to get started is available in this zip file:

### Installing BlangIDE

Really, "installing" amounts to unzipping and copying the contents. The folder contains both the IDE, a template for your own projects, and some command line tools.

The first time you try to launch BlangIDE, depending on the version of Mac OS X and/or security settings, you may get a message saying the "app is not registered with Apple by an identified developer". To work around this, follow these instructions (from Apple) the first time you open the BlangIDE (then Mac OS will remember your decision for subsequent launches):

1. In the Finder, locate BlangIDE (don't use Launchpad to do this).

2. Control-click the app icon, then choose Open from the shortcut menu.

If this does not work, an alternative is also described in the same Apple help page.

### Blang: a fifteen minutes tutorial

A Blang model specifies a joint probability distribution over a collection of random variables.

Here is an example, based on a very simple model for the famous Doomsday argument:

package demo model Doomsday { random RealVar z random RealVar y param RealVar rate laws { z | rate ~ Exponential(rate) y | z ~ ContinuousUniform(0.0, z) } }

Doomsday is a just a name we give to this model. As a convention, we encourage users to capitalize model names (Blang is case-sensitive).

Variables need to specify their type, e.g.: random RealVar z is of type RealVar and we give it the name z. Some of the other important built-in types are IntVar and DenseMatrix.

random and param are Blang keywords. We will get back to the difference between the two.

As a convention, types are capitalized, variable names are not.

The section laws { ... } defines distribution and conditional distributions on the random variables. The syntax is the same as the notation used in probability theory. For example, y | z ~ ContinuousUniform(0.0, z) means that the conditional distribution of y given z is uniformly distributed between zero and z.

Each Blang model is turned into a program supporting various inference methods. To demonstrate that, let's run the above example.

1. Setup one of these two methods: running Blang with the Web App, or with the Blang IDE.

2. Once you follow the above steps, you will get a message about missing arguments. These arguments essentially control the data the model should condition on, as well as the algorithm used to approximate the conditional expectation (the 'inference engine'). The arguments are automatically discovered with the minimal helps of some annotations. We will cover that later. For now, let's provide the minimal set:

--model.rate 1.0 \ --model.y 1.2 \ --model.z NA

This specifies values for rate and y, and mark z as missing (unobserved, and hence sampled over). You will see the following output

Preprocessing started 1 samplers constructed with following prototypes: RealScalar sampled via: [RealSliceSampler] Sampling started Normalization constant estimate: -1.8657991502743467 Preprocessing time: 77.99 ms Sampling time: 2.511 s executionMilliseconds : 2593 outputFolder : /Users/bouchard/w/blangSDK/results/all/2017-12-15-14-45-13-qFhVg0M8.exec

The most important piece of information here is the outputFolder. Look into that directory. You will find in samples/z.csv the samples in a tidy format, ready to be used by any sane data analytic tool.

You can also view the list of all arguments by adding the argument --help.

Let's look at how ContinuousUniform is implemented in the SDK. Since the SDK is written in Blang, you will proceed in the exact same way to create yours. Control click on ContinuousUniform in Blang IDE, you will be taken to its definition:

package blang.distributions /** Uniform random variable over a close interval $$[m, M]$$. */ model ContinuousUniform { random RealVar realization /** The left end point $$m$$ of the interval. $$m \in (\infty, M)$$ */ param RealVar min /** The right end point of the interval. $$M \in (m, \infty)$$ */ param RealVar max laws { logf(min, max) { if (max - min <= 0.0) return NEGATIVE_INFINITY return - log(max - min) } indicator(realization, min, max) { min <= realization && realization <= max } } generate(rand) { rand.uniform(min, max) } }

The syntax should be self-explanatory:

• the laws block defines the density as the sum of the log density factors logf listed (indicator is just a shortcut for 0-1 factors),

• the optional generate block specifies a forward sampling procedure.

The body of logf, indicator, and generate admit a rich and concise, Turing-complete syntax. We will refer to such block as an XExpression. We will talk more about it later on.

Another important method for creating models is by composing and transforming one or several other distribution. Look at the definition of Exponential for example:

package blang.distributions /** Exponential random variable. Values in $$(0, \infty)$$ */ model Exponential { random RealVar realization /** The rate $$\lambda$$, inversely proportional to the mean. $$\lambda > 0$$ */ param RealVar rate laws { realization | rate ~ Gamma(1.0, rate) } generate (rand) { rand.exponential(rate) } }

For both models constructed using an explicit density (like ContinuousUniform), and those constructed by composition (like Exponential), we invoke them in the same way:

randomVariable1, ... | conditioning ~ NameOfModel(parameter1, ...)

where the random variables are listed in the same order as the variables marked by the keyword random appear in the invoked model definition, and the parameters are listed in the same order a the variables marked by the keyword param.

To create your own distribution, simply create a new .bl file in your project. When you want to use it in another file, don't forget to add an import declaration after the package declaration (only certain packages are automatically imported, such as blang.distributions).

A plate is simply an element of a graphical model which is repeated many times. Let's look for example at a simple hierarchical modelling problem: suppose you have a data file failure_counts.csv of this form

"","LV.Type","numberOfLaunches","numberOfFailures" "1","Aerobee",1,0 "2","Angara A5",1,0 "3","Antares 110",2,0 "4","Antares 120",2,0 "5","Antares 130",1,1 "6","Antares 230",1,0 "7","Ariane 1",11,2 "8","Ariane 2",6,1 "9","Ariane 3",11,1 "10","Ariane 40",7,0

Each row contains a Launch Vehicle (LV) type, and the number of successful launches for that type of rocket, as well as the total number of launches. We would like to get a posterior distribution over the failure probability of each LV type via a hierarchical model that borrows strength across types. Here is a Blang model that does that:

package blang.validation.internals.fixtures model HierarchicalModel { param GlobalDataSource data param Plate<String> rocketTypes param Plated<IntVar> numberOfLaunches random Plated<RealVar> failureProbabilities random Plated<IntVar> numberOfFailures random RealVar a ?: latentReal, b ?: latentReal laws { a ~ Exponential(1) b ~ Exponential(1) for (Index<String> rocketType : rocketTypes.indices) { failureProbabilities.get(rocketType) | a, b ~ Beta(a, b) numberOfFailures.get(rocketType) | RealVar failureProbability = failureProbabilities.get(rocketType), IntVar numberOfLaunch = numberOfLaunches.get(rocketType) ~ Binomial(numberOfLaunch, failureProbability) } } }

The for loop here uses plates and plated objects to set up a large graphical models. More generally, the syntax is for (IteratorType iteratorName : collection) { ... }, where collection is any instance of the Iterable interface.

To run the HierarchicalModel example, use the following options:

--model.data failure_counts.csv \ --model.rocketTypes.name LV.Type

The first option correspond to the line param GlobalDataSource data in the Blang model. This provides a default csv file to look for data for all the Plate and Plated variables (a "Plated" type is just a variable that sits within a plate, i.e. that is repeated).

By default, all the Plate and Plated will look for a column with a name corresponding to the one given in the Blang file. We only need to override this default for the rocketTypes plate, by setting the command line argument --model.rocketTypes.name LV.Type.

Arbitrary Java or Xtend types are inter-operable with Blang. When you want to use them as latent variables, some additional work is needed. However Blang provides utilities to assist you in this process, in particular for testing correctness.

As a first example, let's look at how sampling is implemented for Simplex variables in the SDK (i.e. vectors where the entries are constrained to sum to one). Sampling this variable requires special attention because of the sum to one constraint.

After implementing the class DenseSimplex (just a plain Java class, based on a n-by-1 matrix), we add an annotation to point to the sampler that we will design: @Samplers(SimplexSampler).

Here is the sampler:

package blang.mcmc import bayonet.distributions.Random import blang.core.Constrained import blang.types.DenseSimplex import blang.core.LogScaleFactor import java.util.List import blang.mcmc.internals.SamplerBuilderContext import blang.mcmc.internals.SimplexWritableVariable class SimplexSampler implements Sampler { @SampledVariable DenseSimplex simplex @ConnectedFactor List<LogScaleFactor> numericFactors @ConnectedFactor Constrained constrained override void execute(Random rand) { val int sampledDim = rand.nextInt(simplex.nEntries) val SimplexWritableVariable sampled = new SimplexWritableVariable(sampledDim, simplex) val RealSliceSampler slicer = RealSliceSampler::build(sampled, numericFactors, 0.0, sampled.sum) slicer.execute(rand) } override boolean setup(SamplerBuilderContext context) { return simplex.nEntries >= 2 && constrained !== null && constrained.object instanceof DenseSimplex } }

• The actual work is done in the execute method. The SimplexWritableVariable is just a utility which, when entry (dimension) index sampledDim is altered in the simplex, the following index (modulo the number of entries) is decrease by the same amount. After picking an index, we use a slice sampler to perform the actual sampling.

• The instantiation of samplers is automated. The instance variables annotated with @SampledVariable and @ConnectedFactor guide this process.

• @SampledVariable is filled with the variable to be sampled.

• Then the factors connected this variables need to be all assigned to @ConnectedFactor for the sampler to be included in the sampling process.

• LogScaleFactor is the interface for the factors created by logf and indicator blocks.

• Constrained is a factor used to mark variables that require special samplers. For example, the Dirichlet distribution contains the line realization is Constrained to ensure standard samplers for real variables are avoided in the context of a simplex.

• The optional method setup performs additional initialization checks if needed. It should return a boolean indicating whether this sampler should be used or not in the current context.

### More pointers

Additional tutorial and reference materials to go more in-depth: