Exercise 1: Forward simulation

03 Jan 2014

Forward simulation from stochastic processes is generally simple and can often be done exactly. In statistical settings, a more complex task is generally required, posterior simulation, but forward simulation is still very useful for a few reasons. First, it is a great pedagogical tool for understand stochastic processes. Second, it is a useful debugging tool when implementing probabilistic inference software (more on this later).

This first exercise will be devoted to forward simulation for a range of stochastic processes. The majority of the remaining exercises will cover posterior simulation (or sampling), focussing on a more restricted set of processes at a time.

Technical note: while during development it is a good idea to fix random seeds, please create your final package so that the seeds are not fixed (this will allow us to control the sample size by calling the program several times if needed).

Question 1: Continuous-time Markov chains (CTMC)

Implement the forward simulation algorithm covered in class. We will use it to simulate possible future states of a DNA sequence, in other words, to simulate one unit of time independently for each nucleotide.

Your program should read two input files: one containing a DNA sequence, for example:

ACCCTAG

and a second file containing a rate matrix (part of the exercise includes reading what a rate matrix is, and transforming it into an input on which your algorithm can operate on). An example of the rate matrix format is:

-0.3 0.1 0.1 0.1
0.2 -0.6 0.2 0.2
0.1 0.1 -0.3 0.1
0.1 0.1 0.1 -0.3

Note: the rows and columns enumerate the nucleotides in the following order: A, C, G, T.

You will hand-in a zip file called xxxxxx-ex1-q1.zip, where xxxxxx should be replaced by your full student number. This zip file should contain an executable file named run which takes the path of DNA file as a first argument, and the path of the matrix as a second argument. It should output a new simulated string, and only that. For example, typing:

unzip xxxxxx-ex1-q1.zip
cd xxxxxx-ex1-q1
./run dnaFile.txt matrixFile.txt

with the two files in the current directory should print out something like:

GCTCTAG

Question 2: Dirichlet processes

Write two programs that sample from $\underline{\theta_1}, \dots, \underline{\theta_{100}}$ using two different method:

  1. The CRP method (in xxxxxx-ex1-q2-1.zip).
  2. The stick breaking method (in xxxxxx-ex1-q2-2.zip).

Use $\alpha_0 = 1$ and as a base measure, a product of a standard normal and a gamma with shape and rate parameters equal to one.

The output of your algorithm should look like:

-0.2 0.14
0.12 2.7
-1.1 2.4

and so on, for 100 lines.

Question 3: CRPs

Prove that the CRP is exchangeable. Post your response as a pdf named xxxxx-ex1-q3.pdf, where xxxxx is again your student number.