# Blang

Tools for Bayesian data science and probabilistic exploration

Blang is a language and software development kit for doing Bayesian analysis. Our design philosophy is centered around the day-to-day requirements of real world data science. We have also used Blang as a teaching tool, both for basic probability concepts and more advanced Bayesian modelling. Here is the one minute tour:

package demo import conifer.* import static conifer.Utils.* model Example { random RealVar shape ?: latentReal, rate ?: latentReal random SequenceAlignment observations random UnrootedTree tree ?: unrootedTree(observations.observedTreeNodes) param EvolutionaryModel evoModel ?: kimura(observations.nSites) laws { shape ~ Exponential(1.0) rate ~ Exponential(1.0) tree | shape, rate ~ NonClockTreePrior(Gamma.distribution(shape, rate)) observations | tree, evoModel ~ UnrootedTreeLikelihood(tree, evoModel) } }

The above example illustrates several aspects of Blang:

• Blang goes beyond simple real and integer valued random variables. Here we have a type of phylogenetic tree, UnrootedTree. Bayesian inference over combinatorial spaces is important and currently neglected. So Blang uses an open type system and assists you in creating complex random types and correct sampling algorithms for these types.

• All the distributions shown, e.g. Exponential, NonClockTreePrior, etc, in particular, those in the SDK, are themselves written in Blang. This is important for extensibility and, crucially, for teaching. When using the Blang IDE, students can command-click on a distribution to jump to its definition in the language they are familiar with.

• The parameters of the distributions can themselves be distributions, e.g. NonClockTreePrior taking in a Gamma distribution as argument. This is useful to create rich probability models such as Bayesian non-parametric priors.

• As hinted by the keyword package, you can use other people's models, and package yours easily (and in a versioned fashion). This is useful to disseminate your work and critical to create reproducible analyses. Details such as dependency resolution are taken care of automatically.

If you have one more minute to spare, let us see what happen when we run this model (if you want to try at home, all you need to run this is Java 8 SDK and git installed):

> git clone https://github.com/UBC-Stat-ML/blangExample.git [cloning] > ./gradlew installDist [downloading dependencies and compiling] > ./build/install/example/bin/example \ --model.observations.file data/primates.fasta \ --model.observations.encoding DNA \ --engine SCM \ --engine.nThreads Max \ --excludeFromOutput observations Preprocessing started 4 samplers constructed with following prototypes: RealScalar sampled via: [RealSliceSampler] UnrootedTree sampled via: [SingleNNI, SingleBranchScaling] Sampling started [sampling progress report] Normalization constant estimate: -1216.1211229417504 Final rejuvenation started Preprocessing time: 141.4 ms Sampling time: 2.304 min executionMilliseconds : 138405 outputFolder : /Users/bouchard/blangExample/results/all/2017-12-15-14-00-21-3aKAx62h.exec

• We inferred a distribution over unobserved phylogenetic trees given an observed multiple sequence alignment.

• The engine used here, SCM (Sequential Change of Measure) is based on state-of-the-art sampling methods. It uses Jarzynski's method (with annealed distributions created automatically from the model) with a Sequential Monte Carlo algorithm and an adaptive temperature schedule. Other methods include Parallel Tempering, various non-reversible methods, and users can add other inference frameworks as well.

• The algorithm trivially parallelize to a large number of CPUs, here only 8 cores were used. Massive distribution over many nodes is in the pipeline.

• The method provides an estimate of the evidence, here -1216.12, which is critical for model selection and lacking in many existing tools. Running again the method with twice as many particles, we get -1215.95, suggesting the estimate is getting close to the true value. In contrast, variational methods will typically only give a bound on the true evidence.

• The resulting samples are easy to use: tidy csv's in a unique execution folder created for each run. You can integrate Blang in your data analysis pipeline seamlessly.

• The command line arguments are automatically inferred from the random variables declared in the model and the constructors of the corresponding types (with a bit of help of some annotations).

• Blang is built using Xtext, a powerful framework for designing programming languages. Thanks to this infrastructure, Blang incorporates a feature set comparable to many modern full fledge multi-paradigm language: functional, generic and object programming, static typing, just-in-time compilation, garbage collection, IDE support leverage static types and including debugging, etc.

• Blang runs on the JVM, so it is reasonably fast (typically within a small factor to any contender) without resorting to rewriting inner loops into low-level, error prone languages. You can also call any Java or Xtend code, and there is a good interoperability potential with the industrial data science stack such as Hadoop, Spark and DL4J.

• Blang is free and open source (permissive Berkeley License for both the SDK and the language infrastructure).