Normalization constant estimation via stepping stone
Alexandre Bouchard-Côté
Normalization constant
Goal computing
\[Z = \int \gamma(x) {\text{d}}x\]
Main motivation in Bayesian statistics comparing models via Bayes factors, i.e. ratios of marginal likelihood.
Recall from T8-model-selection-basics.html:
- Let \(\gamma(x) = {\text{prior}}(x) {\text{likelihood}}(x)\)
- Computing \(Z\) for a first model.
- Computing \(Z'\) for a second model.
- Taking a ratio \(Z/Z'\) gives us the Bayes factor.
Note in contrast to say computing say a posterior mean, a marginal likelihood is not an expectation, in the sense that \(\ux x\) in not a probability distribution in general. This is why it requires its own set of computational methods.
Background
Suppose
\[\gamma(x) = |\sin(x) e^{-x^2}| {{\bf 1}}[x > 0]\]
We want to approximate
\[Z = \int \gamma(x) {\text{d}}x\]
Poll: which of these converge to \(Z\) (a.s.)
- \((1/n) \sum_i |\sin(X_i) e^{-X_i^2}|/e^{-X_i}\) for \(X_i \sim \text{Exp}(1)\)
- \((1/n) \sum_i |\sin(X_i) e^{-X_i^2}|/(2 e^{-2 X_i})\) for \(X_i \sim \text{Exp}(2)\)
- \((1/n) \sum_i |\sin(X_i) e^{-X_i^2}| {{\bf 1}}[X_i > 0] \sqrt{2\pi}/e^{-X_i^2/2}\) for \(X_i \sim \text{Normal}(0, 1)\)
- All of the above
- None of the above
Background: importance sampling
We want to approximate
\[Z = \int \gamma(x) {\text{d}}x\]
All of the choices follow the following structure: \(X_i \sim q\), and
\[
\begin{align*}
Z = \int \gamma(x) {\text{d}}x &= \int \gamma(x) \frac{q(x)}{q(x)} {\text{d}}x \\
&= \int \frac{\gamma(x)}{q(x)} q(x) {\text{d}}x \\
&= {\mathbf{E}}\left[ \frac{\gamma(X_1)}{q(X_1)} \right] \leftarrow \frac{1}{n} \sum_{i=1}^n \frac{\gamma(X_i)}{q(X_i)}
\end{align*}
\]
This is just importance sampling!
Note for this to work, \(q\) is assumed to be properly normalized.
Next
- First, a naive way to approximate normalization constants using importance sampling
- Why it is broken
- How to fix it
Naive method to compute normalization constants with importance sampling
We want to approximate
\[Z = \int \gamma(x) {\text{d}}x\]
- In the Bayesian context \(\gamma(x) = {\text{prior}}(x) {\text{likelihood}}(x)\)
- First idea: use \(q(x) = {\text{prior}}(x)\), get the estimator:
\[
\frac{1}{n} \sum_{i=1}^n \frac{\gamma(X_i)}{q(X_i)} = \frac{1}{n} \sum_{i=1}^n \frac{{\text{prior}}(X_i) {\text{likelihood}}(X_i)}{{\text{prior}}(X_i)} = \frac{1}{n} \sum_{i=1}^n {\text{likelihood}}(X_i)
\]
- This idea works very poorly in practice!
- Reason: “a sample of size approximately \(\exp(\text{KL}(\gamma\|q))\) is necessary and sufficient for accurate estimation by importance sampling” see https://arxiv.org/abs/1511.01437
- \(\text{KL}(\gamma\|q)\) is the KL divergence
- A similar idea (also bad) is to use the posterior distribution \(\pi\) for \(q\), this yields the so called “harmonic mean estimator”, also known as the “Worst Monte Carlo Method Ever”
Stepping stone: a better way to use importance sampling for normalization constants
- Suppose we have not one joint distribution but instead a sequence \(\gamma_0, \gamma_1, \gamma_2, \dots, \gamma_N\)
- Let \(Z_i = \int \gamma_i(x) {\text{d}}x\)
- Suppose \(Z_0\) is known (e.g. for prior distribution, the normalization constants are often known)
- We are interested in \(Z_N\)
- \(\pi_i = \gamma_i / Z_i\)
- Example: \(\pi_i\) obtained by annealing between the prior (\(\pi_0\)) and posterior (\(\pi_N\)), as in the parallel tempering topic
- Insight: in many practical problems, the value of \(N\) needed for \(\text{KL}(\pi_i\|\pi_{i+1}) = O(1)\) grows slowly (probably polynomially?)
Stepping stone: some details
Write:
\[Z = Z_N = \frac{Z_N}{Z_{N-1}} \frac{Z_{N-1}}{Z_{N-2}} \dots \frac{Z_{1}}{Z_{0}} Z_0\]
- Each ratio will be estimated using an importance sampling estimator
- The last one, \(Z_0\), is assumed to be known.
Estimating \(Z_{i+1}/Z_i\) based on importance sampling:
- We have samples for \(\pi_{i+1}\) and \(\pi_i\)
- Let us use \(q = \pi_i\) as proposal (we want proposals to have tails that are not thinner than target; \(\pi_i\) is closer to prior, hence will often have larger tails), and hence \(\gamma = \pi_{i+1}\)
- Twist: we cannot evaluate \(q\), only an unnormalized version, \(q = \tilde q / C\).
\[
\begin{align*}
\frac{Z_{i+1}}{Z_i} = \frac{1}{C} Z = \int \gamma(x) {\text{d}}x &= \frac{1}{C} \int \gamma(x) \frac{\tilde q(x)}{\tilde q(x)} {\text{d}}x \\
&= \frac{1}{C} \int \frac{\gamma(x)}{\tilde q(x)} \tilde q(x) {\text{d}}x \\
&= \int \frac{\gamma(x)}{\tilde q(x)} q(x) {\text{d}}x \\
&= {\mathbf{E}}\left[ \frac{\gamma(X_1)}{\tilde q(X_1)} \right] \leftarrow \frac{1}{n} \sum_{i=1}^n \frac{\gamma(X_i)}{\tilde q(X_i)}
\end{align*}
\]
For more information, see: https://pubmed.ncbi.nlm.nih.gov/21187451/
This is the default method used in Blang, which you can find in Plots > monitoringPlots > logNormalizationContantProgress.pdf, as we have seen earlier in the course, see e.g. Model_selection.html