next up previous contents
Next: Normalization, non-negativity, and specific Up: Bayesian framework Previous: General loss functions and   Contents


Maximum A Posteriori Approximation

In most applications the (usually very high or even formally infinite dimensional) ${h}$-integral over model states in Eq. (23) cannot be performed exactly. The two most common methods used to calculate the ${h}$ integral approximately are Monte Carlo integration [155,91,95,198,16,70,199,21,219,238,69,170,180,202,171] and saddle point approximation [16,45,30,172,17,249,201,69,76,132]. The latter approach will be studied in the following.

For that purpose, we expand $E_{\rm comb}$ of Eq. (24) with respect to ${h}$ around some ${h}^{*}$

$\displaystyle E_{\rm comb} ({h})$ $\textstyle =$ $\displaystyle e^{(\Delta {h},\nabla) }E({h})\Big\vert _{{h}={h}^{*}}$ (66)
  $\textstyle =$ $\displaystyle E_{\rm comb} ({h}^{*})
+ ( \Delta {h} , \nabla ({h}^{*}) )
+ \frac{1}{2}( \Delta {h}, {\bf H}({h}^{*}) \Delta {h})
+ \cdots$  

with $\Delta {h}$ = $( {h}-{h}^{*})$, gradient $\nabla$ (not acting on $\Delta {h}$), Hessian ${\bf H}$, and round brackets $( \cdots , \cdots)$ denoting scalar products. In case $p(y\vert x,{h})$ is parameterized independently for every $x$, $y$ the states ${h}$ represent a parameter set indexed by $x$ and $y$, hence
\begin{displaymath}
\nabla (x,y;{h}^{*})
= \frac{\delta E_{\rm comb}({h})}
{\d...
...e ,{h}))}
{\delta p(y\vert x,{h})}\Bigg\vert _{{h}={h}^{*}}
,
\end{displaymath} (67)


\begin{displaymath}
{\bf H}(x,y,x^\prime,y^\prime;{h}^{*})
= \frac{\delta^2 E_{...
...lta p(y^\prime\vert x^\prime,{h})}
\Bigg\vert _{{h}={h}^{*}}
,
\end{displaymath} (68)

are functional derivatives [97,106,29,36] (or partial derivatives for discrete $x$, $y$) and for example
\begin{displaymath}
(\Delta {h}, \nabla {h}^{*} )
= \int\!dx\,dy\,
\left( {h}(x,y)-{h}^{*}(x,y) \right) \nabla ({h}^{*})(x,y)
.
\end{displaymath} (69)

Choosing ${h}^{*}$ to be the location of a local minimum of $E_{\rm comp}({h})$ the linear term in (66) vanishes. The second order term includes the Hessian and corresponds to a Gaussian integral over ${h}$ which could be solved analytically

\begin{displaymath}
\int\!d{h}\, e^{-\frac{1}{2}(\Delta {h}, {\bf H}\Delta {h})}
=
(2\pi)^\frac{d}{2}
(\det {\bf H})^{-\frac{1}{2}}
,\end{displaymath} (70)

for a $d$-dimensional ${h}$-integral. However, using the same approximation for the ${h}$-integrals in numerator and denominator of Eq. (23), expanding then also $p(y\vert x,{h})$ around ${h}^{*}$, and restricting to the first (${h}$-independent) term $p(y\vert x,{h}^{*})$ of that expansion, the factor (70) cancels, even for infinite $d$. (The result is the zero order term of an expansion of the predictive density in powers of $1/\beta$. Higher order contributions can be calculated by using Wick's theorem [45,30,172,249,109,163,132].) The final approximative result for the predictive density (27) is very simple and intuitive
\begin{displaymath}
p(y\vert x,f) \approx p(y\vert x,{h}^{*})
,
\end{displaymath} (71)

with
\begin{displaymath}
{h}^{*}
\!= {\rm argmin}_{{h}\in {H}} E_{comb}
\!= {\rm ar...
...{\rm argmax}_{{h}\in {H}} \,p(y_D\vert x_D,{h})p(h\vert D_0)
.
\end{displaymath} (72)

The saddle point (or Laplace) approximation is therefore also called Maximum A Posteriori Approximation (MAP). Notice that the same $h^*$ also maximizes the integrand of the evidence of the data $y_D$
\begin{displaymath}
p(y_D\vert x_D,D_0)
=
\int dh\, p(y_D\vert x_D,h) p(h\vert D_0)
.
\end{displaymath} (73)

This is due to the assumption that $p(y\vert x,h)$ is slowly varying at the stationary point and has not to be included in the saddle point approximation for the predictive density. For (functional) differentiable $E_{comb}$ Eq. (72) yields the stationarity equation,
\begin{displaymath}
\frac{\delta E_{\rm comb}({h})}{\delta {h}(x,y)} = 0
.
\end{displaymath} (74)

The functional $E_{\rm comb}$ including training and prior data (regularization, stabilizer) terms is also known as (regularized) error functional for ${h}$.

In practice a saddle point approximation may be expected useful if the posterior is peaked enough around a single maximum, or more general, if the posterior is well approximated by a Gaussian centered at the maximum. For asymptotical results one would have to require

\begin{displaymath}
-\frac{1}{\beta} \sum_i L(y_i\vert x_i,{h})
,
\end{displaymath} (75)

to become $\beta $-independent for $\beta\rightarrow \infty$ with some $\beta $ being the same for the prior and data term. (See [40,242]). If for example $\frac{1}{n} \sum_i L(y_i\vert x_i,{h})$ converges for large number $n$ of training data the low temperature limit $1/\beta\rightarrow 0$ can be interpreted as large data limit $n\rightarrow\infty$,
\begin{displaymath}
n E_{\rm comb} =
n \left( -\frac{1}{n} \sum_i L(y_i\vert x_i,{h})
+ \frac{1}{n} E({h}\vert D_0) \right)
.
\end{displaymath} (76)

Notice, however, the factor $1/n$ in front of the prior energy. For Gaussian $p(y\vert x,{h})$ temperature $1/\beta$ corresponds to variance $\sigma^2$
\begin{displaymath}
\frac{1}{\sigma^2}
E_{\rm comb} =
\frac{1}{\sigma^2}
\left...
...} \sum_i (y_i-{h}(x_i))^2
+ \sigma^2
E({h}\vert D_0) \right)
.
\end{displaymath} (77)

For Gaussian prior this would require simultaneous scaling of data and prior variance.

We should also remark that for continuous $x$,$y$ the stationary solution ${h}^{*}$ needs not to be a typical representative of the process $p({h}\vert f)$. A common example is a Gaussian stochastic process $p({h}\vert f)$ with prior energy $E({h}\vert D_0)$ related to some smoothness measure of ${h}$ expressed by derivatives of $p(y\vert x,{h})$. Then, even if the stationary ${h}^{*}$ is smooth, this needs not to be the case for a typical ${h}$ sampled according to $p({h}\vert f)$. For Brownian motion, for instance, a typical sample path is not even differentiable (but continuous) while the stationary path is smooth. Thus, for continuous variables only expressions like $\int \!d{h}\,e^{-\beta E({h})}$ can be given an exact meaning as a Gaussian measure, defined by a given covariance with existing normalization factor, but not the expressions $d{h}$ and $E({h})$ alone [51,62,228,110,83,144].

Interestingly, the stationary ${h}^{*}$ yielding maximal posterior $p({h}\vert f)$ is not only useful to obtain an approximation for the predictive density $p(y\vert x,f)$ but is also the optimal solution $a^*$ for a Bayesian decision problem with log-loss and $a\in A={H}$.

Indeed, for a Bayesian decision problem with log-loss (46)

\begin{displaymath}
{\rm argmin}_{a\in {H}} r(a,{h}) = {h}
,
\end{displaymath} (78)

and analogously,
\begin{displaymath}
{\rm argmin}_{a\in F} r(a,f) = f
.
\end{displaymath} (79)


This is proved as follows: Jensen's inequality states that

\begin{displaymath}
\int\!dy\, p(y)g(q(y))
\ge
g( \int\!dy\, p(y) q(y) )
,
\end{displaymath} (80)

for any convex function $g$ and probability $p(y)\ge 0$ with $\int\!dy\, p(y) =1$. Thus, because the logarithm is concave
\begin{displaymath}
-\int\!dy\, p(y\vert x,{h}) \ln\frac{p(y\vert x,a)}{p(y\vert...
...dy\, p(y\vert x,{h}) \frac{p(y\vert x,a)}{p(y\vert x,{h})}
=0
\end{displaymath} (81)


\begin{displaymath}
\Rightarrow
-\int\!dy\, p(y\vert x,{h}) \ln p(y\vert x,a)
\ge
-\int\!dy\, p(y\vert x,{h}) \ln p(y\vert x,{h})
,
\end{displaymath} (82)

with equality for $a={h}$. Hence
$\displaystyle r(a,{h})$ $\textstyle =$ $\displaystyle -\int\!dx \int\!dy\, p(x) p(y\vert x,{h})
\left( b_1(x) \ln p(y\vert x,a) - b_2(x,y) \right)$ (83)
  $\textstyle =$ $\displaystyle -\int\!dx\, p(x)b_1(x) \int\!dy\, p(y\vert x,{h}) \ln p(y\vert x,a) + {\rm const.}$ (84)
  $\textstyle \ge$ $\displaystyle -\int\!dx\, p(x)b_1(x) \int\!dy\, p(y\vert x,{h}) \ln p(y\vert x,{h}) + {\rm const.}$ (85)
  $\textstyle =$ $\displaystyle r({h},{h})
,$ (86)

with equality for $a={h}$. For $a\in F$ replace ${h}\in {H}$ by $f\in F$. This proves Eqs. (78) and (79).


next up previous contents
Next: Normalization, non-negativity, and specific Up: Bayesian framework Previous: General loss functions and   Contents
Joerg_Lemm 2001-01-21