Recap: graphical model and JAGS implementation

drawing

Approaches:

Discretize “slope” and “intercept” into \(K\) bins, draw the decision tree (how many leaves do not get killed?)
- Use the kill-renormalize method to compute \({\mathbb{P}}(\text{predicted binary} = 1|\text{data})\)
Use MCMC/JAGS!
- This creates a set of Monte Carlo samples \(S = \{X_1, X_2, \dots, X_M\}\)
- Here \(X_m = (\text{slope}_m, \text{intercept}_m, \text{predictedBinary}_m)\)

model {

  slope ~ dnorm(0, 0.001)
  intercept ~ dnorm(0, 0.001)

  for (i in 1:length(temperature)) {
    # incident[i] ~ dbern(1.0 / (1.0 + exp(- intercept - slope * temperature[i])))
    p[i] <- ilogit(intercept + slope * temperature[i])
    incident[i] ~ dbern(p[i])
  }

  # predictedPr <- 1.0 / (1.0 + exp(- intercept - slope * 31))
  predictedPr <- ilogit(intercept + slope * 31)
  predictedBinary ~ dbern(predictedPr)
}

require("rjags")

## Loading required package: rjags

## Loading required package: coda

## Linked to JAGS 4.2.0

## Loaded modules: basemod,bugs

require("coda")
require("ggplot2")

data <- read.csv( file = "challenger_data-train.csv", header = TRUE)

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
my.seed <- 1 
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  'model.bugs', 
  data = list( # Pass on the data:
    'incident' = data$Damage.Incident, 
    'temperature' = data$Temperature), 
  inits=inits)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 23
##    Unobserved stochastic nodes: 4
##    Total graph size: 111
## 
## Initializing model

samples <- 
  coda.samples(model,
               c('slope', 'intercept', 'predictedPr', 'predictedBinary'), # These are the variables we want to monitor (plot, etc)
               100000) # number of MCMC iterations
summary(samples)

## 
## Iterations = 1001:101000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 1e+05 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                    Mean      SD  Naive SE Time-series SE
## intercept       19.8246 9.25184 0.0292569       0.803987
## predictedBinary  0.9930 0.08361 0.0002644       0.001392
## predictedPr      0.9925 0.04013 0.0001269       0.001341
## slope           -0.3032 0.13592 0.0004298       0.011746
## 
## 2. Quantiles for each variable:
## 
##                    2.5%     25%     50%     75%    97.5%
## intercept        5.6143 13.1947 18.4525 25.1527 41.85570
## predictedBinary  1.0000  1.0000  1.0000  1.0000  1.00000
## predictedPr      0.9329  0.9989  0.9999  1.0000  1.00000
## slope           -0.6264 -0.3816 -0.2830 -0.2057 -0.09434

plot(samples)

print(HPDinterval(samples))

## [[1]]
##                      lower       upper
## intercept        4.0054009 38.42484799
## predictedBinary  1.0000000  1.00000000
## predictedPr      0.9723887  1.00000000
## slope           -0.5749028 -0.06893211
## attr(,"Probability")
## [1] 0.95

# "Jailbreaking" from coda into a saner data frame
chain <- samples[[1]]
intercept <- chain[,1]
slope <- chain[,4]
df <- data.frame("intercept" = as.numeric(intercept), "slope" = as.numeric(slope))
summary(df)

##    intercept          slope         
##  Min.   :-2.853   Min.   :-0.98311  
##  1st Qu.:13.195   1st Qu.:-0.38156  
##  Median :18.452   Median :-0.28303  
##  Mean   :19.825   Mean   :-0.30323  
##  3rd Qu.:25.153   3rd Qu.:-0.20573  
##  Max.   :66.041   Max.   : 0.02563

ggplot(df, aes(intercept, slope)) +
  geom_density_2d()

Some things to pay particular attention to:

Credible intervals do not “spill” outside of \([0, 1]\) in contrast to most frequentist-derived confidence intervals
Bivariate posterior viz via “jail breaking” coda

Recap: A/B testing

In its early days, Google used a method in the spirit of what is described below to tweak their website. Let us say that they want to decide if they should use a white background or a black background (or some other cosmetic feature).

Each time enters the website for the first time, flip a fair (virtual) coin.
If it is head, show the website with a black background (option A), else, show the one with a white background (option B).
Record if the user click on at least one search result (as a surrogate to whether they “liked” the website or not).

After a short interval of time, they record the following results:

number of times A was shown: 51
number of times A was shown and the user clicked: 45
number of times B was shown: 47
number of times B was shown and the user clicked: 37

Think about the following questions:

What would be a model suitable for a Bayesian analysis of this problem?
How to make a decision (A or B) based on the MCMC output?
Implementing the idea using JAGS.

Recap: A/B testing JAGS implementation

require("rjags")
require("coda")
require("ggplot2")

modelstring="
  model {
  
    pA ~ dunif(0, 1)
    pB ~ dunif(0, 1)
  
    sA ~ dbin(pA, nA)
    sB ~ dbin(pB, nB)
    
    AisBetterThanB <- ifelse(pA > pB, 1, 0)
  }
"

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
my.seed <- 1 
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  textConnection(modelstring), 
  data = list( # Pass on the data:
    'nA' = 51,
    'sA' = 45, 
    'nB' = 47,
    'sB' = 37), 
  inits=inits)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 2
##    Unobserved stochastic nodes: 2
##    Total graph size: 16
## 
## Initializing model

samples <- 
  coda.samples(model,
               c('pA', 'pB', 'AisBetterThanB'), # These are the variables we want to monitor (plot, etc)
               10000) # number of MCMC iterations
summary(samples)

## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                  Mean      SD  Naive SE Time-series SE
## AisBetterThanB 0.8904 0.31241 0.0031241      0.0032970
## pA             0.8673 0.04641 0.0004641      0.0004566
## pB             0.7762 0.05888 0.0005888      0.0005689
## 
## 2. Quantiles for each variable:
## 
##                  2.5%    25%    50%    75%  97.5%
## AisBetterThanB 0.0000 1.0000 1.0000 1.0000 1.0000
## pA             0.7620 0.8394 0.8720 0.9007 0.9439
## pB             0.6529 0.7388 0.7799 0.8179 0.8803

plot(samples)

print(HPDinterval(samples))

## [[1]]
##                    lower     upper
## AisBetterThanB 0.0000000 1.0000000
## pA             0.7743609 0.9493196
## pB             0.6578581 0.8831351
## attr(,"Probability")
## [1] 0.95

chain <- samples[[1]]
pA <- chain[,2]
pB <- chain[,3]
df <- data.frame("pA" = as.numeric(pA), "pB" = as.numeric(pB))
summary(df)

##        pA               pB        
##  Min.   :0.5934   Min.   :0.5235  
##  1st Qu.:0.8394   1st Qu.:0.7388  
##  Median :0.8720   Median :0.7799  
##  Mean   :0.8673   Mean   :0.7762  
##  3rd Qu.:0.9007   3rd Qu.:0.8179  
##  Max.   :0.9788   Max.   :0.9465

ggplot(df, aes(pA, pB)) +
  xlim(0, 1) +
  ylim(0, 1) +
  geom_abline(intercept = 0, slope = 1) +
  geom_point(size = 0.3, alpha = 0.3)

ggplot(df, aes(pA, pB)) +
  xlim(0, 1) +
  ylim(0, 1) +
  geom_abline(intercept = 0, slope = 1) +
  geom_density_2d()

Fraction of the points in \(S\) that fall in the lower triangle approximates \({\mathbb{P}}(p_A > p_B | \text{data})\)
This approximates the integral of the posterior density in the region \(p_A > p_B\):

\[\frac{\text{# points in lower triangle}}{M} \approx {\mathbb{P}}(p_A > p_B | \text{data}) = \int \int_{x > y} f_{p_A, p_B|\text{data}}(x, y) {\text{d}}x {\text{d}}y\]

The probability \({\mathbb{P}}(p_A > p_B | \text{data})\) can be thought of as Bayesian counterpart for “significance”
In practice, you probably also want to report a measure of effect size.
- If the difference is very small between \(p_A\) and \(p_B\), but you have tons of data \({\mathbb{P}}(p_A > p_B | \text{data})\) will still be close to one.
- Effect size: roughly, how big is the difference \(\text{diff} = p_A - p_B\)?
- Significance: how reliability have we inferred the sign of the difference?
- How would you approach the problem of assessing effect size in our AB model?

Example adapted from: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb

Hierarchical model

Motivation:

Say we want to predict a rocket success/fail for a type we do not have much data for
- E.g. “shiny rocket” slide from first lecture: Falcon Heavy, 1 success, 1 trial
- Let’s build upon our familiar model:

single_rocket_spec <- 
"model {
  # The likelihood
  X ~ dbin(p,n)
  
  # Our prior
  p ~ dbeta(a,b)
}"

Recall from last week: when data is sparse, posterior is sensitive to specification of the prior (e.g., “hyper-parameter”, \(a\) and \(b\) for the Beta prior “dbeta(a,b)”)
- Extreme example: for a maiden flight, what will the posterior look like?

Key idea: use “side data”" to inform the prior

For example: success/fail launch data from other other types of rockets
Can we use the full rocket launch data provided to inform prediction for a single rocket type of interest?

data <- read.csv("failure_counts.csv")
data %>% 
  head() %>% 
  knitr::kable(floating.environment="sidewaystable")

LV.Type	numberOfLaunches	numberOfFailures
Aerobee	1	0
Angara A5	1	0
Antares 110	2	0
Antares 120	2	0
Antares 130	1	1
Antares 230	1	0

How to (badly) use side data

First try

Merge all the data?
I.e. just sum the columns in the data
Why is this bad?

LV.Type	numberOfLaunches	numberOfFailures
Aerobee	1	0
Angara A5	1	0
Antares 110	2	0
Antares 120	2	0
Antares 130	1	1
Antares 230	1	0

Why it is bad

Consider: adding to the dataset one more rocket type, launched successfully 10,000 times…
Is there a better way?

Towards an improved way to use side data

Background: “mean-pseudo-sample-size” reparameterization of the Beta distribution
- A reparametrization is a different labelling of a family such that you can go back and forth between the two labellings
- Consider \[\alpha = \mu s, \;\; \beta = (1 - \mu) s\] where \(\mu \in (0, 1)\), \(s > 0\).
Interpretation:
- \(\mu\): mean of the Beta
- \(s\): measure of “peakiness” of the density, higher \(s\) corresponds to more peaked; roughly, \(s \sim\) number of data points that would make the posterior peaked like that.
Why we did this reparameterization?
- Now it should be more intuitive how we could go about using side data to inform at least \(\mu\)
- Ideas?

An improved way to use side data (not quite full Bayesian yet!)

Estimate \(p_i\) for each type of rocket
Fit a distribution
Use this distribution as the prior on \(\mu\)
Weakness? Hint: what are the bumps at 1/2 and 1?
Also: less clear how to do this with the pseudo-sample-size \(s\)

counts <- read.csv("failure_counts.csv")
ggplot(counts, aes(x = numberOfFailures / numberOfLaunches)) +
  geom_histogram()

Solution: go fully Bayesian

Recall:

Construct a probability model including
- random variables for what we will measure/observe
- random variables for the unknown quantities
  - those we are interested in (“parameters”, “predictions”)
  - others that just help us formulate the problem (“nuisance”, “random effects”).
Compute the posterior distribution conditionally on the actual data at hand
Use the posterior distribution to:
- make prediction (point estimate)
- estimate uncertainty (credible intervals)
- make a decision

drawing

In our case: just make \(\mu\) and \(s\) random! (or equivalently, \(\alpha\) and \(\beta\))

Share these two “global parameters” across all launch types \[p_i | \mu, s \sim {\text{Beta}}(\mu s, (1 - \mu) s)\]
We still need to put prior on \(\mu\) and \(s\)
- But you should expect this prior choice to be less sensitive. Why? Hint: look at the number of outgoing edges in the graphical model. Compare with number of outgoing edges for the “maiden flight” (see next slide)
- Example: \(\mu \sim {\text{Beta}}(1,1) = {\text{Unif}}(0, 1)\), \(s \sim \text{Exponential}(1/10000)\) (why such a small value? Hint: mean vs. rate parameter)

New higher-level hyperparameters = new problems? No we are probably ok!

It seems we have introduced new problems as now we again have hyperparameters, namely those for the priors on \(\mu\) and \(s\). Here we picked \(\mu \sim {\text{Beta}}(1,1) = {\text{Unif}}(0, 1)\), \(s \sim \text{Exponential}(1/10000)\)

Key point: yes, but now we are less sensitive to these choices!

Why? Heuristic: say you have a random variable connected to some hyper-parameters (grey squares) and random variables connected to data (circles)

If most of the connections are hyper-parameters: will probably be sensitive
If there are many more connections to random variables compared to hyper-parameters: will probably be insensitive

Before going hierarchical: for maiden/early flights we had

drawing

After going hierarchical:

drawing

Using more information

I didn’t show you the full data I scraped on rocket launches. See below
Can we exploit the full structure?
- E.g. CC: Cape Canaveral Launch (american); NIIP Scientific Research Test Range, Russian: Nauchno-Issledovatel’skii Ispytatel’nyi Poligon
For the Falcon Heavy (“shiny rocket”), should we share parameters across all \(p_i\) or maybe just the other american rocket types?
- If we settle on the latter, what if later we want to do a prediction for another country?

full <- read.csv("processed.csv")

[1] "    X  X..Launch    Launch.Date..UTC.      COSPAR         PL.Name                        Orig.PL.Name                SATCAT   LV.Type                 LV.S.N            Site                              Suc   Ref                     Suc_bin  Family           Space.Port    Year   Launch.Index"
[2] "-----  -----------  ---------------------  -------------  -----------------------------  --------------------------  -------  ----------------------  ----------------  --------------------------------  ----  ---------------------  --------  ---------------  -----------  -----  -------------"
[3] "    1  1957 ALP     1957 Oct  4 1928:34    1957 ALP 2     1-y ISZ                        PS-1                        S00002   Sputnik 8K71PS          M1-PS             NIIP-5   LC1                      S     Energiya                      1  Sputnik          NIIP          1957              1"
[4] "    2  1957-U01     1957 Oct 17 0505       1957-U01       USAF 88 Charge A               Poulter Pellet              A08258   Aerobee                 USAF 88           HADC     A                        S     EngSci1.58                    1  Aerobee          HADC          1957              1"
[5] "    3  1957 BET     1957 Nov  3 0230:42    1957 BET 1     2-y ISZ                        PS-2                        S00003   Sputnik 8K71PS          M1-2PS            NIIP-5   LC1                      S     Grahn-WWW                     1  Sputnik          NIIP          1957              2"
[6] "    4  1957-F01     1957 Dec  6 1644:35    1957-F01       Vanguard                       Vanguard Test Satellite     F00002   Vanguard                TV-3              CC       LC18A                    F     Vang-ER9948                   0  Vanguard         CC            1957              1"

Taller hierarchies

Natural idea: recurse! Think about it as the stories of a tall building:
- Foundations: launch data organized by type of rocket
- First floor of the model: parameters for each type of rocket \(p_i\) \[p_i | \mu_{c(i)}, s_{c(i)} \sim {\text{Beta}}(\mu_{c(i)} s_{c(i)}, (1 - \mu_{c(i)}) s_{c(i)})\] where \(c(i)\) is the country for rocket type \(i\)
- Second floor: country-specific parameters, \(\mu_c\), \(s_c\) \[\mu_c | \mu, s \sim {\text{Beta}}(\mu s, (1 - \mu s))\] \[s_{c(i)} | \alpha, \beta \sim {\text{Gamma}}(\alpha, \beta)\]
- Third floor: global parameters \(\mu, s, \alpha, \beta\)
In principle, could build as many floors in that hierarchy as factors in the dataset.
- In theory could go taller. Why we don’t usually do that in practice? Hint: again, draw graphical model.
- However, some work investigate inferring that structure, google “Bayesian hierarchical clustering” to get one flavour of this

Next example: change point problem / segmentation

In a healthy cell, chromosomes (long strands of DNA) come in pairs (except for the sex chromosomes).
In certain cancers, some parts of the chromosome come to lose one of the two copies.
Suppose we segment one (non-sex) chromosome into segments \(0, 1, 2, \dots, 49\).
For each segment, let \(S_i\) encodes the number of copies in this segment.
Examples of possible realizations for the copy numbers:

drawing

Change point problem

Goal of inference: copy numbers, which are hard to observe directly.
Instead, use sequencing data.
- DNA of many cells is pooled and cut into small pieces
- sequencing machines read some of these small pieces and map into one of the 50 segments.
Key point: if a segment has two copies, there should be on average twice as many reads mapped into it compared to a segment with only one copy.
Let \(K\) be the average number of reads we get per segment and per chromosome.
- Could be determined experimentally or inferred, let say we known \(K=3.5\).

Data looks like this:

drawing

Can you build a Bayesian model for that?

How to pick distributions?

Use the type of the realization to narrow down the possibilities
- Example: for the Poisson emissions, we knew the support had to be \(\{0, 1, 2, 3, \dots\}\); this excludes Bernoulli and Uniform.
Exploratory data analysis
Ask experts/check literature if possible
Theoretical grounds (asympotic theory, mechanistic models, etc)
- Example: “law of rare events” provides some interpretation
Try several methods and use Bayesian model selection techniques (next week, if time permits)

Frequently used distributions

By type:

Binary data types
- Bernoulli (by definition)
Enumeration data types
- Categorical (by definition)
Integers \(\ge 0\)
- Geometric (mode at zero)
- Poisson (mode close to mean)
- For more detailed models, Negative Binomial (2 parameters)
Continuous \(\ge 0\):
- Exponential (mode at zero)
- For more detailed models, Gamma
Continuous in \([0, 1]\):
- Continuous Uniform
- For more detailed models, Beta
Continuous on real line:
- Normal

By interpretation:

Sum of Bernoullis:
- Binomial
Population survey:
- Hyper Geometric
Waiting times:
- Geometric, Negative Binomial

Mathematical model

Let \(S_0, S_1, \dots, S_{49}\) denote a list of random variable modelling the unobserved states.
- Each state takes two possible values (the “copy number state”).
- We want to encourage a priori the states to “stick” to the same value (i.e. the copy number does not change too often as we move along the genome), so we set \[P(S_i = s | S_{i-1} = s) = 0.95\]
- Exercise (part 1): how to write this in our mathematical notation, \(S_i | S_{i-1} \sim ???\) (hint: use Bernoullis and change the encoding of the states as 0 and 1 instead of 1 and 2)
- Exercise (part 2): write a JAGS model (hint: check “ifelse(x, a, b)” in the JAGS manual
- Also have a look at the Blang model for comparison
Let \(Y_0, Y_1, \dots, Y_{49}\) denote the observations (number of reads mapped for each of the 50 segments).
We will use the following model (a common choice in bioinformatics): \[Y_i | S_i \sim \text{Poisson}(3.5 (S_i + 1))\]
Recall: Poisson distribution
Exercise (part 3): draw graphical model
Recall \(K\) is the average number of reads we get per segment and per chromosome.
- How to proceed is \(K\) is not known?
- Part 4: Implement this idea in Blang as follows:
  - In the variable declarations, make \(K\) random instead of param, and initialize it with ?: latentReal intead of fixedReal(3.5)
  - Add a prior to \(K\) (see distributions reference)
- Graphical model of the new model?

JAGS under the hood

Key technique used by JAGS: slice sampling.

Why should we look under the hood? To understand the failure modes and how to address them.

Things that the slice sampler has problem with (failure modes):

“Unidentifiable models”
“Multimodal problems”

Solutions to these problems:

Hamiltonian Monte Carlo
- e.g. implemented in Stan; only works for continuous models
- will only solve the first failure mode
Parallel Tempering
- implemented in Blang; works for continuous and discrete
- solves both problems but may need many cores to do it quickly

First failure mode: unidentifiability

We have done two thought experiments so far:

Letting the number of observations (rocket launches) go to infinity, \(N \to \infty\)
- Data collection budget
- Often no control on that
Letting the number of Monte Carlo samples go to infinity, \(M \to \infty\)
- Computational budget
- If not happy with results, easy to increase

drawing

Recall:
- for “nice” model such as this one, as \(N \to \infty\) the posterior gets arbitrarily peaky (“converges to a point mass”).
- But there are less nice models where we do not get convergence to a point mass as the data size goes to infinity, for example:

model Unidentifiable {
  ...
  laws {
    p ~ ContinuousUniform(0, 1)
    p2 ~ ContinuousUniform(0, 1)
    nFails | p, p2, nLaunches ~ Binomial(nLaunches, p * p2)
  }
}

Results:

drawing

Observations:
- Indeed, the posterior does not become degenerate as the sample size increases; this is called an unidentifiable model
- It is now asymmetric. This is not a bug! Can you figure out why? (optional, harder brain teaser)

drawing

MCMC: failure modes

Unidentifiability creates computational difficulties.

Example:

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  require(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

require(rjags)
require(coda)
require(ggplot2)

modelstring="
  model {
  
    p1 ~ dunif(0, 1)
    p2 ~ dunif(0, 1)
  
    x ~ dbin(p1 * p2, n)
  }
"

# Make the simulations reproducible
# 1 is arbitrary, but the rest of the simulation is 
# deterministic given that initial seed.
plots <- list()
i <- 1
for (n in c(10,10000)) {local({
  my.seed <- 1 
  set.seed(my.seed) 
  inits <- list(.RNG.name = "base::Mersenne-Twister",
                .RNG.seed = my.seed)
  
  model <- jags.model(
    textConnection(modelstring), 
    data = list( # Pass on the data:
      'n' = n,
      'x' = n/2), 
    inits=inits) 
  
  samples <- 
    coda.samples(model,
                 c('p1', 'p2'), # These are the variables we want to monitor (plot, etc)
                 100) # number of MCMC iterations

  
  chain <- samples[[1]]
  p1 <- chain[,1]
  p2 <- chain[,2]
  df <- data.frame("p1" = as.numeric(p1), "p2" = as.numeric(p2))
  
  p <- ggplot(df, aes(p1, p2)) +
    xlim(0, 1) +
    ylim(0, 1) +
    geom_path()
  plots[[i]] <<- p

  })
  i <- i + 1
}

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 2
##    Total graph size: 9
## 
## Initializing model
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 2
##    Total graph size: 9
## 
## Initializing model

multiplot(plotlist = plots)

## Loading required package: grid

Both plots show the first 100 iterations from JAGS
- Top one: exploring the unidentifiable posterior based on 10 observations (5 of which are failure)
- Bottom one: exploring the unidentifiable posterior based on 10000 observations (5000 of which are failure)
As you can see, for this problem the posterior gets harder and harder to explore as the size of the dataset increase
- Why? More on this soon after we go over slice sampling.

Second failure mode: “multimodality”

Toy example:

N <- 100000

components <- sample(1:3,prob=c(0.3,0.5,0.2),size=N,replace=TRUE)
mus <- c(0,10,3)
sds <- sqrt(c(1,1,0.1))

samples <- rnorm(n=N,mean=mus[components],sd=sds[components])
data <- data.frame(samples)
ggplot(data, aes(samples)) + geom_density() + theme_bw()

Real example: Bayesian clustering
- Details of the model soon
- To start with, let us compare the trace plot for JAGS/Stan vs. Blang

require(rjags)
require(ggplot2)
require(coda)
require(ggmcmc) # Note: nice package converting JAGS output into tidy format + ggplot-based trace/posterior histograms/etc

## Loading required package: ggmcmc

modelstring="
  model {
    for (k in 1:2) {
      mu[k] ~ dnorm(0, .0001)
      sigma[k] ~ dunif(0, 100)
      alpha[k] <- 1
    }
    pi ~ ddirch(alpha)
    
    for (n in 1:length(observation)) {
      z[n] ~ dcat(pi)
      observation[n] ~ dnorm(mu[z[n]], sigma[z[n]])
    }
  }
"

data <- read.csv( file = "mixture_data_with_header.csv", header = TRUE)

# Make the simulations reproducible
my.seed <- 1 # 1 is arbitrary. The rest of the simulation is deterministic given that initial seed.
set.seed(my.seed) 
inits <- list(.RNG.name = "base::Mersenne-Twister",
              .RNG.seed = my.seed)

model <- jags.model(
  textConnection(modelstring), 
  data = data, 
  inits = inits)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 300
##    Unobserved stochastic nodes: 305
##    Total graph size: 1216
## 
## Initializing model

samples <- 
  coda.samples(model,
               c('mu'), # These are the variables we want to monitor (plot, compute expectations, etc)
               10000) # number of MCMC iterations

tidy <- ggs(samples)
ggs_traceplot(tidy) + ylim(100,210) + theme_bw()

Let us compare to a more powerful sampler: parallel tempering via Blang

drawing

Prerequisite to understand slice sampling and MH

Question: how to design a sampler for a uniform distribution on a set \(A\), assuming only pointwise evaluation?

Recall: pointwise evaluation means all you can do is: “given \(x\), answer if \(x \in A\)”.

Example: \(A = \{0, 1, 2\}\)

Idea: use a random walk so that it scales to complicated/high dimensional \(A\)

Idea 1 (incorrect)

Move to one neighbor at random.

compute_stationary_distribution <- function(matrix) {
  eig <- eigen(t(M))
  eigenvalues = eig$values
  i <- which.max(Re(eigenvalues[abs(Im(eigenvalues)) < 1e-6]))
  statio <- (eig$vectors)[,i]
  return(statio)
}

M = matrix(c(
  0,   1, 0,
  1/2, 0, 1/2,
  0,   1, 0
), nrow=3, ncol=3, byrow=T)

barplot(compute_stationary_distribution(M))

This code will be explained soon.
For now, if you are not convinced, simulate your own Markov chain and build a histogram, you will recover the above picture.

Idea 2: propose + accept reject

Say we are at position \(x_t \in A\) at iteration \(t\)
Propose to go to left or right with equal probability. Get proposal \(x^*\)
If \(x^* \in A\), set \(x_{t+1} = x^*\) (accept)
Else, set \(x_{t+1} = x_t\) (reject)

M = matrix(c(
  1/2,   1/2, 0,
  1/2,   0,   1/2,
  0,     1/2, 1/2
), nrow=3, ncol=3, byrow=T)

barplot(compute_stationary_distribution(M))

Understanding why idea/walk 1 fails while idea/walk 2 works

Think of a very large population of people each with a laptop running MCMC
There are 3 rooms corresponding to the 3 states \(A = \{0, 1, 2\}\)
Depending on the state on their laptop, each person goes in the corresponding room

Occupancy vector and transition matrix

Room occupancies can be organized into a vector. Normalize by number of people and get a distribution.
If you start with occupancies \(v\) and let everyone run their MCMC for one step and do their room change, you get a new occupancy \(v'\)
Exercise: convince yourself \(v' = M v\) where row \(i\) of \(M\) is the transition probabilities for an MCMC at state \(i\). I.e. \[M_{i,j} = {\mathbb{P}}(X_{t+1} = j | X_t = i)\]

Stationarity

The distribution of interest is \(\pi = (1/3, 1/3, 1/3)\).
Let us try to build \(M\) such that if we start with an occupancy given by the distribution of interest, \(v = \pi\), then after one step that occupancy is exactly preserved, \[v' = M \pi \Rightarrow v' = \pi\] (this is called global balance)
For ‘idea/walk 2’ this is easy to see: between any pair of rooms, \(i\) and \(j\) the number of people going from \(i\) to \(j\) is the same as the number going from \(j\) to \(i\) (this is an instance of detailed balance/reversibility)
Exercise: show ‘idea/walk 1’ does not satisfy global balance whereas ‘idea/walk 2’ does
If we have \[\pi = M \pi,\] this means \(\pi\) is a fixed point of applying \(M\).
Then a general principle in maths is that if you repeatedly apply a map under certain condition you will converge to a fixed point of that map
The twist is that the fixed “point” is really a vector here, which we interpret as a distribution!

drawing

Note: the fixed point equation is the same as the equation defining eigenvectors and eigenvalues (up to transpose). This explains how I implemented the function compute_stationary_distribution in the earlier R code:

print(compute_stationary_distribution)

## function(matrix) {
##   eig <- eigen(t(M))
##   eigenvalues = eig$values
##   i <- which.max(Re(eigenvalues[abs(Im(eigenvalues)) < 1e-6]))
##   statio <- (eig$vectors)[,i]
##   return(statio)
## }

STAT 520 - Bayesian Analysis

Goals

Logistics

Recap: graphical model and JAGS implementation

Recap: A/B testing

Recap: A/B testing JAGS implementation

Hierarchical model

How to (badly) use side data

Why it is bad

Towards an improved way to use side data

An improved way to use side data (not quite full Bayesian yet!)

Solution: go fully Bayesian

New higher-level hyperparameters = new problems? No we are probably ok!

Using more information

Taller hierarchies

Next example: change point problem / segmentation

Change point problem

How to pick distributions?

Frequently used distributions

Mathematical model

JAGS under the hood

First failure mode: unidentifiability

MCMC: failure modes

Second failure mode: “multimodality”

Prerequisite to understand slice sampling and MH

Idea 1 (incorrect)

Idea 2: propose + accept reject

Understanding why idea/walk 1 fails while idea/walk 2 works

Occupancy vector and transition matrix

Stationarity