Alexandre Bouchard-Côté

**Goal** computing

\[Z = \int \gamma(x) {\text{d}}x\]

**Main motivation in Bayesian statistics** comparing models via Bayes factors, i.e. ratios of marginal likelihood.

Recall from T8-model-selection-basics.html:

- Let \(\gamma(x) = {\text{prior}}(x) {\text{likelihood}}(x)\)
- Computing \(Z\) for a first model.
- Computing \(Z'\) for a second model.
- Taking a ratio \(Z/Z'\) gives us the Bayes factor.

**Note** in contrast to say computing say a posterior mean, a marginal likelihood is not an expectation, in the sense that \(\ux x\) in not a probability distribution in general. This is why it requires its own set of computational methods.

Suppose

\[\gamma(x) = |\sin(x) e^{-x^2}| {{\bf 1}}[x > 0]\]

We want to approximate

\[Z = \int \gamma(x) {\text{d}}x\]

- \((1/n) \sum_i |\sin(X_i) e^{-X_i^2}|/e^{-X_i}\) for \(X_i \sim \text{Exp}(1)\)
- \((1/n) \sum_i |\sin(X_i) e^{-X_i^2}|/(2 e^{-2 X_i})\) for \(X_i \sim \text{Exp}(2)\)
- \((1/n) \sum_i |\sin(X_i) e^{-X_i^2}| {{\bf 1}}[X_i > 0] \sqrt{2\pi}/e^{-X_i^2/2}\) for \(X_i \sim \text{Normal}(0, 1)\)
- All of the above
- None of the above

We want to approximate

\[Z = \int \gamma(x) {\text{d}}x\]

All of the choices follow the following structure: \(X_i \sim q\), and

\[ \begin{align*} Z = \int \gamma(x) {\text{d}}x &= \int \gamma(x) \frac{q(x)}{q(x)} {\text{d}}x \\ &= \int \frac{\gamma(x)}{q(x)} q(x) {\text{d}}x \\ &= {\mathbf{E}}\left[ \frac{\gamma(X_1)}{q(X_1)} \right] \leftarrow \frac{1}{n} \sum_{i=1}^n \frac{\gamma(X_i)}{q(X_i)} \end{align*} \]

**This is just importance sampling!**

**Note** for this to work, \(q\) is assumed to be properly normalized.

**Next**

- First, a naive way to approximate normalization constants using importance sampling
- Why it is broken
- How to fix it

We want to approximate

\[Z = \int \gamma(x) {\text{d}}x\]

- In the Bayesian context \(\gamma(x) = {\text{prior}}(x) {\text{likelihood}}(x)\)
- First idea: use \(q(x) = {\text{prior}}(x)\), get the estimator:

\[ \frac{1}{n} \sum_{i=1}^n \frac{\gamma(X_i)}{q(X_i)} = \frac{1}{n} \sum_{i=1}^n \frac{{\text{prior}}(X_i) {\text{likelihood}}(X_i)}{{\text{prior}}(X_i)} = \frac{1}{n} \sum_{i=1}^n {\text{likelihood}}(X_i) \]

- This idea works very poorly in practice!
- Reason: “a sample of size approximately \(\exp(\text{KL}(\gamma\|q))\) is necessary and sufficient for accurate estimation by importance sampling” see https://arxiv.org/abs/1511.01437
- \(\text{KL}(\gamma\|q)\) is the KL divergence

- A similar idea (also bad) is to use the posterior distribution \(\pi\) for \(q\), this yields the so called “harmonic mean estimator”, also known as the “Worst Monte Carlo Method Ever”

- Suppose we have not one joint distribution but instead a sequence \(\gamma_0, \gamma_1, \gamma_2, \dots, \gamma_N\)
- Let \(Z_i = \int \gamma_i(x) {\text{d}}x\)
- Suppose \(Z_0\) is known (e.g. for prior distribution, the normalization constants are often known)
- We are interested in \(Z_N\)
- \(\pi_i = \gamma_i / Z_i\)
- Example: \(\pi_i\) obtained by annealing between the prior (\(\pi_0\)) and posterior (\(\pi_N\)), as in the parallel tempering topic

- Insight: in many practical problems, the value of \(N\) needed for \(\text{KL}(\pi_i\|\pi_{i+1}) = O(1)\) grows slowly (probably polynomially?)

Write:

\[Z = Z_N = \frac{Z_N}{Z_{N-1}} \frac{Z_{N-1}}{Z_{N-2}} \dots \frac{Z_{1}}{Z_{0}} Z_0\]

- Each ratio will be estimated using an importance sampling estimator
- The last one, \(Z_0\), is assumed to be known.

Estimating \(Z_{i+1}/Z_i\) based on importance sampling:

- We have samples for \(\pi_{i+1}\) and \(\pi_i\)
- Let us use \(q = \pi_i\) as proposal (we want proposals to have tails that are not thinner than target; \(\pi_i\) is closer to prior, hence will often have larger tails), and hence \(\gamma = \pi_{i+1}\)
- Twist: we cannot evaluate \(q\), only an unnormalized version, \(q = \tilde q / C\).

\[ \begin{align*} \frac{Z_{i+1}}{Z_i} = \frac{1}{C} Z = \int \gamma(x) {\text{d}}x &= \frac{1}{C} \int \gamma(x) \frac{\tilde q(x)}{\tilde q(x)} {\text{d}}x \\ &= \frac{1}{C} \int \frac{\gamma(x)}{\tilde q(x)} \tilde q(x) {\text{d}}x \\ &= \int \frac{\gamma(x)}{\tilde q(x)} q(x) {\text{d}}x \\ &= {\mathbf{E}}\left[ \frac{\gamma(X_1)}{\tilde q(X_1)} \right] \leftarrow \frac{1}{n} \sum_{i=1}^n \frac{\gamma(X_i)}{\tilde q(X_i)} \end{align*} \]

For more information, see: https://pubmed.ncbi.nlm.nih.gov/21187451/

This is the default method used in Blang, which you can find in Plots > monitoringPlots > logNormalizationContantProgress.pdf, as we have seen earlier in the course, see e.g. Model_selection.html