Normalization constant

Goal computing

\[Z = \int \gamma(x) {\text{d}}x\]

Main motivation in Bayesian statistics comparing models via Bayes factors, i.e. ratios of marginal likelihood.

Recall from T8-model-selection-basics.html:

Let \(\gamma(x) = {\text{prior}}(x) {\text{likelihood}}(x)\)
Computing \(Z\) for a first model.
Computing \(Z'\) for a second model.
Taking a ratio \(Z/Z'\) gives us the Bayes factor.

Note in contrast to say computing say a posterior mean, a marginal likelihood is not an expectation, in the sense that \(\ux x\) in not a probability distribution in general. This is why it requires its own set of computational methods.

Background

Suppose

\[\gamma(x) = |\sin(x) e^{-x^2}| {{\bf 1}}[x > 0]\]

We want to approximate

\[Z = \int \gamma(x) {\text{d}}x\]

Poll: which of these converge to \(Z\) (a.s.)

\((1/n) \sum_i |\sin(X_i) e^{-X_i^2}|/e^{-X_i}\) for \(X_i \sim \text{Exp}(1)\)
\((1/n) \sum_i |\sin(X_i) e^{-X_i^2}|/(2 e^{-2 X_i})\) for \(X_i \sim \text{Exp}(2)\)
\((1/n) \sum_i |\sin(X_i) e^{-X_i^2}| {{\bf 1}}[X_i > 0] \sqrt{2\pi}/e^{-X_i^2/2}\) for \(X_i \sim \text{Normal}(0, 1)\)
All of the above
None of the above

Background: importance sampling

We want to approximate

\[Z = \int \gamma(x) {\text{d}}x\]

All of the choices follow the following structure: \(X_i \sim q\), and

\[ \begin{align*} Z = \int \gamma(x) {\text{d}}x &= \int \gamma(x) \frac{q(x)}{q(x)} {\text{d}}x \\ &= \int \frac{\gamma(x)}{q(x)} q(x) {\text{d}}x \\ &= {\mathbf{E}}\left[ \frac{\gamma(X_1)}{q(X_1)} \right] \leftarrow \frac{1}{n} \sum_{i=1}^n \frac{\gamma(X_i)}{q(X_i)} \end{align*} \]

This is just importance sampling!

Note for this to work, \(q\) is assumed to be properly normalized.

Next

First, a naive way to approximate normalization constants using importance sampling
Why it is broken
How to fix it

Naive method to compute normalization constants with importance sampling

We want to approximate

\[Z = \int \gamma(x) {\text{d}}x\]

In the Bayesian context \(\gamma(x) = {\text{prior}}(x) {\text{likelihood}}(x)\)
First idea: use \(q(x) = {\text{prior}}(x)\), get the estimator:

\[ \frac{1}{n} \sum_{i=1}^n \frac{\gamma(X_i)}{q(X_i)} = \frac{1}{n} \sum_{i=1}^n \frac{{\text{prior}}(X_i) {\text{likelihood}}(X_i)}{{\text{prior}}(X_i)} = \frac{1}{n} \sum_{i=1}^n {\text{likelihood}}(X_i) \]

This idea works very poorly in practice!
Reason: “a sample of size approximately \(\exp(\text{KL}(\gamma\|q))\) is necessary and sufficient for accurate estimation by importance sampling” see https://arxiv.org/abs/1511.01437
- \(\text{KL}(\gamma\|q)\) is the KL divergence
A similar idea (also bad) is to use the posterior distribution \(\pi\) for \(q\), this yields the so called “harmonic mean estimator”, also known as the “Worst Monte Carlo Method Ever”

Stepping stone: a better way to use importance sampling for normalization constants

Suppose we have not one joint distribution but instead a sequence \(\gamma_0, \gamma_1, \gamma_2, \dots, \gamma_N\)
- Let \(Z_i = \int \gamma_i(x) {\text{d}}x\)
- Suppose \(Z_0\) is known (e.g. for prior distribution, the normalization constants are often known)
- We are interested in \(Z_N\)
- \(\pi_i = \gamma_i / Z_i\)
- Example: \(\pi_i\) obtained by annealing between the prior (\(\pi_0\)) and posterior (\(\pi_N\)), as in the parallel tempering topic
Insight: in many practical problems, the value of \(N\) needed for \(\text{KL}(\pi_i\|\pi_{i+1}) = O(1)\) grows slowly (probably polynomially?)

Stepping stone: some details

Write:

\[Z = Z_N = \frac{Z_N}{Z_{N-1}} \frac{Z_{N-1}}{Z_{N-2}} \dots \frac{Z_{1}}{Z_{0}} Z_0\]

Each ratio will be estimated using an importance sampling estimator
The last one, \(Z_0\), is assumed to be known.

Estimating \(Z_{i+1}/Z_i\) based on importance sampling:

We have samples for \(\pi_{i+1}\) and \(\pi_i\)
Let us use \(q = \pi_i\) as proposal (we want proposals to have tails that are not thinner than target; \(\pi_i\) is closer to prior, hence will often have larger tails), and hence \(\gamma = \pi_{i+1}\)
Twist: we cannot evaluate \(q\), only an unnormalized version, \(q = \tilde q / C\).

\[ \begin{align*} \frac{Z_{i+1}}{Z_i} = \frac{1}{C} Z = \int \gamma(x) {\text{d}}x &= \frac{1}{C} \int \gamma(x) \frac{\tilde q(x)}{\tilde q(x)} {\text{d}}x \\ &= \frac{1}{C} \int \frac{\gamma(x)}{\tilde q(x)} \tilde q(x) {\text{d}}x \\ &= \int \frac{\gamma(x)}{\tilde q(x)} q(x) {\text{d}}x \\ &= {\mathbf{E}}\left[ \frac{\gamma(X_1)}{\tilde q(X_1)} \right] \leftarrow \frac{1}{n} \sum_{i=1}^n \frac{\gamma(X_i)}{\tilde q(X_i)} \end{align*} \]

For more information, see: https://pubmed.ncbi.nlm.nih.gov/21187451/

This is the default method used in Blang, which you can find in Plots > monitoringPlots > logNormalizationContantProgress.pdf, as we have seen earlier in the course, see e.g. Model_selection.html

Normalization constant estimation via stepping stone

Normalization constant

Background

Poll: which of these converge to \(Z\) (a.s.)

Background: importance sampling

Naive method to compute normalization constants with importance sampling

Stepping stone: a better way to use importance sampling for normalization constants

Stepping stone: some details