Point estimates, confidence estimates, and the Bayes estimator

Alexandre Bouchard-Côté

Overview

Point estimation: when you have to make a single best guess
Set estimation: when you want to convey how much confidence you have about this best guess

First example

You repeatedly put your finger on a Earth globe uniformly at random
Each time, you record if you “landed” on water (W) or land (L)
The goal is to estimate the proportion \(p\) of Earth covered by water

Based on: example in textbook Statistical Rethinking

Bayes estimator

Recall:

\[\color{blue}{{\textrm{argmin}}} \{ \color{red}{{\mathbf{E}}}[\color{blue}{L}(a, Z) \color{green}{| X}] : a \in {\mathcal{A}}\}\]

encodes a 3-step approach applicable to almost any statistical problems:

\(\color{red}{\text{Construct a probability model}}\)
\(\color{green}{\text{Compute or approximate the posterior distribution conditionally on the actual data at hand}}\)
\(\color{blue}{\text{Solve an optimation problem to turn the posterior distribution into an action}}\)

For the globe water/land example: steps 1 and 2 is the same as the Delta Rocket example

\(Z\) is \(p\)
\(X\) is the list of W / L encoded as 1 for W

Model

Joint distribution \(\gamma(p, x)\) is specified via chain rule as:
- Uniform distribution on a random variable:
  - \(p \sim {\text{Unif}}(0, 1)\)
- Putting your finger on the globe corresponds to an independent and identically distributed Bernoulli draw with parameter \(p\),
  - \(x_i | p \sim {\text{Bern}}(p)\)

Point estimate

The goal is to estimate the proportion \(p\) of Earth covered by water
Often, you need to provide one numerical best guess
How to do this optimally?

Point estimate from Bayes estimator

Select a loss function
Solve the optimization problem specified by the Bayes estimator

\[\delta^*(X) = \color{blue}{{\textrm{argmin}}} \{ \color{red}{{\mathbf{E}}}[\color{blue}{L}(a, Z) \color{green}{| X}] : a \in {\mathcal{A}}\}\]

Step 3: \(\color{blue}{\text{Solve an optimation problem to turn the posterior distribution into an action}}\)

Example: square loss \({\mathcal{A}}= {\mathbf{R}}\), \(L(a, p) = (a - p)^2\)

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[(Z - a)^2 | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[Z^2 | X] - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \end{aligned} \]

Poll: \(\delta^*(x)\) can be simplified to…

\(\int z p(z|x) {\text{d}}z\)
\({\textrm{argmax}}\{ p(z|x) : z \in {\mathbf{R}}\}\)
\(\int x p(z|x) {\text{d}}x\)
\({\textrm{argmax}}\{ p(z|x) : x \in {\mathbf{R}}\}\)
None of the above

Point estimate from Bayes estimator

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[(Z - a)^2 | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbf{E}}[Z^2 | X] - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ - 2a{\mathbf{E}}[Z | X]] + a^2 : a \in {\mathcal{A}}\} \end{aligned} \]

Idea: think of \({\mathbf{E}}[Z|X]\) as a constant that you get from the posterior. To minimize the bottom expression, take derivative with respect to \(a\), equate to zero:

\[ \begin{aligned} -2 {\mathbf{E}}[Z|X] + 2a = 0 \end{aligned} \] Hence: here the Bayes estimator is the posterior mean, \(\delta^*(X) = {\mathbf{E}}[Z|X] = \int z p(z|X) {\text{d}}z\).

Set estimate

The weakness of point estimates is that they do not capture the uncertainty around the value
Idea: instead of returning a single point, return a set of points
- usually an interval,
- but this can be generalized
Bayesian terminology: credible interval (\(\neq\) frequentist confidence intervals)
Goals:
- We would like the credible interval to contain a fixed fraction of the posterior mass (e.g. 95%)
- At the same time, we would like this credible interval to be as short as possible given that posterior mass constraint
Bayes estimator formalization:
- \({\mathcal{A}}= \{[c, d] : c < d\}\),
- consider the loss function given by \[ L([c, d], z) = {{\bf 1}}\{z \notin [c, d]\} + k (d - c) \] for some tuning parameter \(k\) to be determined later.

We get:

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z \notin [c, d] | X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z < c|X] + {\mathbb{P}}[Z > d |X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \\ &= {\textrm{argmin}}\{ {\mathbb{P}}[Z \le c|X] - {\mathbb{P}}[Z \le d |X] + k(d - c) : [c,d] \in {\mathcal{A}}\} \end{aligned} \]

Assuming the posterior has a continuous density \(f\) to change \(<\) into \(\le\). Again we take the derivative with respect to \(c\) and set to zero; then will do the same thing for \(d\). Notice that \({\mathbb{P}}[Z \le c|X]\) is the posterior CDF, so taking the derivative with respect to \(c\) yields a density:

\[ f_{Z|X}(c) - k = 0, \]

so we see the optimum will be the smallest interval \([c, d]\) such that \(f(c) = f(d) = k\).

Finally, set \(k\) to capture say 95% of the mass.