Bayes estimators

Recall The Bayes estimator,

\[\color{blue}{{\textrm{argmin}}} \{ \color{red}{{\mathbf{E}}}[\color{blue}{L}(a, Z) \color{green}{| X}] : a \in {\mathcal{A}}\}\]

encodes a 3-step approach applicable to almost any statistical problems:

\(\color{red}{\text{Construct a probability model}}\)
\(\color{green}{\text{Compute or approximate the posterior distribution conditionally on the actual data at hand}}\)
\(\color{blue}{\text{Solve an optimation problem to turn the posterior distribution into an "action"}}\)

Estimators

We want to devise a decision-making strategy, which we formalize as an estimator:
- a function that take as input only the observations, \(\delta(x)\), and output a proposed action, \(\delta(x) \in {\mathcal{A}}\).
- i.e., \(\delta : {\mathscr{X}}\to {\mathcal{A}}\)
We want this estimator to be as “good” as possible.
- Under a certain criterion of goodness, we will see that the Bayesian framework provides a principled and systematic way of specifying a “best” estimator.

Evaluation of estimators

Frequentist risk: view \(\theta = z\) as parameters for a likelihood / indexing probabilities over observables \(\{{\mathbb{P}}_z\}\), with a corresponding collections of expectation operators \(\{{\mathbf{E}}_z\}\), \[ \begin{aligned} R(z, \delta) &= {\mathbf{E}}_z[L(\delta(X), z)] \\ &= \int L(\delta(x), z)\ \text{likelihood}(x | z) {\text{d}}x \end{aligned} \]
Bayesian notion: integrated risk \[ \begin{aligned} r(\delta) &= {\mathbf{E}}[L(\delta(X), Z)] \\ &= \int \int L(\delta(x), z)\ \text{prior}(z)\ \text{likelihood}(x | z)\ {\text{d}}x {\text{d}}z \end{aligned} \]

Key difference:

Frequentist risk: a partial order on estimators
- Only canonical notion of optimality is then non-dominance, called (statistical) efficiency
Bayes risk: a complete order on estimators (under weak conditions)
- Can actually get an expression for that optimal estimator
- As a bonus, also satisfies the frequentist notion of non-sub-optimality: Bayes estimators are efficient under weak conditions (more on this later)

The Bayes estimator

So far: abstract definition of Bayes estimators as minimizers of the integrated risk \[ \begin{aligned} \delta^* &= {\textrm{argmin}}_{\delta : {\mathscr{X}}\to {\mathcal{A}}} \{ r(\delta) \} \\ r(\delta) &= {\mathbf{E}}[L(\delta(X), Z)] \end{aligned} \]

More explicit expression: The estimator \(\delta^*\), defined by the equation below, minimizes the integrated risk

\[ \delta^*(X) = {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \]

This estimator \(\delta^*\) is called a Bayes estimator.

This means that given a model and a goal, the Bayesian framework provides in principle a recipe for constructing an estimator.

However, the computation required to implement this recipe may be considerable. This explains why computational statistics plays a large role in Bayesian statistics and in this course.

Black box optimization

Objective function from \(M\) Monte Carlo samples:

\[ \begin{aligned} \delta^*(X) &= {\textrm{argmin}}\{ {\mathbf{E}}[L(a, Z) | X] : a \in {\mathcal{A}}\} \\ &\approx {\textrm{argmin}}\{ \frac{1}{M} \sum_{i=1}^M L(a, Z_i) : a \in {\mathcal{A}}\} \\ \end{aligned} \]

Idea that could be part of a project: stochastic gradient meets Bayes estimators
If \({\mathcal{A}}\) is tricky to explore (combinatorial, constrained such as the motivating tracking problem, etc), and \({\mathcal{A}}= \{z \in {\mathscr{Z}}\}\) can further approximate both the objective and constraints as follows:

Idea that could be part of a project: Bayes estimator for feature matrices

Bayes estimators from a frequentist perspective

Recall: admissibility, a frequentist notion of optimality (or rather, non-sub-optimality).

An estimator \(\delta\) is admissible if there are no dominating estimator \(\delta'\)
Domination here under the frequentist risk \(R(z, \delta) = {\mathbf{E}}_z[L(\delta(X), z)]\),
i.e. \(\delta\) is admissible if there is no \(\delta'\) such that for all \(z\), \(R(z, \delta') < R(z, \delta)\)

Proposition: if a Bayes estimator is unique, it is admissible.

To show uniqueness, may try to use convexity of loss function for example.