Maximum A Posteriori Approximation

Next: Normalization, non-negativity, and specific Up: Bayesian framework Previous: General loss functions and Contents

Maximum A Posteriori Approximation

In most applications the (usually very high or even formally infinite dimensional) -integral over model states in Eq. (23) cannot be performed exactly. The two most common methods used to calculate the integral approximately are Monte Carlo integration [155,91,95,198,16,70,199,21,219,238,69,170,180,202,171] and saddle point approximation [16,45,30,172,17,249,201,69,76,132]. The latter approach will be studied in the following.

For that purpose, we expand $E_{\rm comb}$ of Eq. (24) with respect to around some ${h}^{*}$

$\displaystyle E_{\rm comb} ({h})$	$\textstyle =$	$\displaystyle e^{(\Delta {h},\nabla) }E({h})\Big\vert _{{h}={h}^{*}}$	(66)
	$\textstyle =$	$\displaystyle E_{\rm comb} ({h}^{}) + ( \Delta {h} , \nabla ({h}^{}) ) + \frac{1}{2}( \Delta {h}, {\bf H}({h}^{*}) \Delta {h}) + \cdots$

with $\Delta {h}$ = $( {h}-{h}^{*})$ , gradient $\nabla$ (not acting on $\Delta {h}$ ), Hessian ${\bf H}$ , and round brackets $( \cdots , \cdots)$ denoting scalar products. In case $p(y\vert x,{h})$ is parameterized independently for every

the states

represent a parameter set indexed by

and

, hence

$\begin{displaymath} \nabla (x,y;{h}^{*}) = \frac{\delta E_{\rm comb}({h})} {\d... ...e ,{h}))} {\delta p(y\vert x,{h})}\Bigg\vert _{{h}={h}^{*}} , \end{displaymath}$

(67)

$\begin{displaymath} {\bf H}(x,y,x^\prime,y^\prime;{h}^{*}) = \frac{\delta^2 E_{... ...lta p(y^\prime\vert x^\prime,{h})} \Bigg\vert _{{h}={h}^{*}} , \end{displaymath}$

(68)

are functional derivatives [97,106,29,36] (or partial derivatives for discrete

) and for example

$\begin{displaymath} (\Delta {h}, \nabla {h}^{*} ) = \int\!dx\,dy\, \left( {h}(x,y)-{h}^{*}(x,y) \right) \nabla ({h}^{*})(x,y) . \end{displaymath}$

(69)

Choosing ${h}^{*}$ to be the location of a local minimum of $E_{\rm comp}({h})$ the linear term in (66) vanishes. The second order term includes the Hessian and corresponds to a Gaussian integral over which could be solved analytically

$\begin{displaymath} \int\!d{h}\, e^{-\frac{1}{2}(\Delta {h}, {\bf H}\Delta {h})} = (2\pi)^\frac{d}{2} (\det {\bf H})^{-\frac{1}{2}} ,\end{displaymath}$

(70)

for a

-dimensional

-integral. However, using the same approximation for the

-integrals in numerator and denominator of Eq. (23), expanding then also $p(y\vert x,{h})$ around ${h}^{*}$ , and restricting to the first (

-independent) term $p(y\vert x,{h}^{*})$ of that expansion, the factor (70) cancels, even for infinite

. (The result is the zero order term of an expansion of the predictive density in powers of $1/\beta$ . Higher order contributions can be calculated by using Wick's theorem [45,30,172,249,109,163,132].) The final approximative result for the predictive density (27) is very simple and intuitive

$\begin{displaymath} p(y\vert x,f) \approx p(y\vert x,{h}^{*}) , \end{displaymath}$

(71)

with

$\begin{displaymath} {h}^{*} \!= {\rm argmin}_{{h}\in {H}} E_{comb} \!= {\rm ar... ...{\rm argmax}_{{h}\in {H}} \,p(y_D\vert x_D,{h})p(h\vert D_0) . \end{displaymath}$

(72)

The saddle point (or Laplace) approximation is therefore also called Maximum A Posteriori Approximation (MAP). Notice that the same

also maximizes the integrand of the evidence of the data

$\begin{displaymath} p(y_D\vert x_D,D_0) = \int dh\, p(y_D\vert x_D,h) p(h\vert D_0) . \end{displaymath}$

(73)

This is due to the assumption that $p(y\vert x,h)$ is slowly varying at the stationary point and has not to be included in the saddle point approximation for the predictive density. For (functional) differentiable $E_{comb}$ Eq. (72) yields the stationarity equation,

$\begin{displaymath} \frac{\delta E_{\rm comb}({h})}{\delta {h}(x,y)} = 0 . \end{displaymath}$

(74)

The functional $E_{\rm comb}$ including training and prior data (regularization, stabilizer) terms is also known as (regularized) error functional for

In practice a saddle point approximation may be expected useful if the posterior is peaked enough around a single maximum, or more general, if the posterior is well approximated by a Gaussian centered at the maximum. For asymptotical results one would have to require

$\begin{displaymath} -\frac{1}{\beta} \sum_i L(y_i\vert x_i,{h}) , \end{displaymath}$

(75)

to become $\beta$ -independent for $\beta\rightarrow \infty$ with some $\beta$ being the same for the prior and data term. (See [40,242]). If for example $\frac{1}{n} \sum_i L(y_i\vert x_i,{h})$ converges for large number

of training data the low temperature limit $1/\beta\rightarrow 0$ can be interpreted as large data limit $n\rightarrow\infty$ ,

$\begin{displaymath} n E_{\rm comb} = n \left( -\frac{1}{n} \sum_i L(y_i\vert x_i,{h}) + \frac{1}{n} E({h}\vert D_0) \right) . \end{displaymath}$

(76)

Notice, however, the factor

in front of the prior energy. For Gaussian $p(y\vert x,{h})$ temperature $1/\beta$ corresponds to variance $\sigma^2$

$\begin{displaymath} \frac{1}{\sigma^2} E_{\rm comb} = \frac{1}{\sigma^2} \left... ...} \sum_i (y_i-{h}(x_i))^2 + \sigma^2 E({h}\vert D_0) \right) . \end{displaymath}$

(77)

For Gaussian prior this would require simultaneous scaling of data and prior variance.

We should also remark that for continuous , the stationary solution ${h}^{*}$ needs not to be a typical representative of the process $p({h}\vert f)$ . A common example is a Gaussian stochastic process $p({h}\vert f)$ with prior energy $E({h}\vert D_0)$ related to some smoothness measure of expressed by derivatives of $p(y\vert x,{h})$ . Then, even if the stationary ${h}^{*}$ is smooth, this needs not to be the case for a typical sampled according to $p({h}\vert f)$ . For Brownian motion, for instance, a typical sample path is not even differentiable (but continuous) while the stationary path is smooth. Thus, for continuous variables only expressions like $\int \!d{h}\,e^{-\beta E({h})}$ can be given an exact meaning as a Gaussian measure, defined by a given covariance with existing normalization factor, but not the expressions and alone [51,62,228,110,83,144].

Interestingly, the stationary ${h}^{*}$ yielding maximal posterior $p({h}\vert f)$ is not only useful to obtain an approximation for the predictive density $p(y\vert x,f)$ but is also the optimal solution for a Bayesian decision problem with log-loss and $a\in A={H}$ .

Indeed, for a Bayesian decision problem with log-loss (46)

$\begin{displaymath} {\rm argmin}_{a\in {H}} r(a,{h}) = {h} , \end{displaymath}$

(78)

and analogously,

$\begin{displaymath} {\rm argmin}_{a\in F} r(a,f) = f . \end{displaymath}$

(79)

This is proved as follows: Jensen's inequality states that

$\begin{displaymath} \int\!dy\, p(y)g(q(y)) \ge g( \int\!dy\, p(y) q(y) ) , \end{displaymath}$

(80)

for any convex function

and probability $p(y)\ge 0$ with $\int\!dy\, p(y) =1$ . Thus, because the logarithm is concave

$\begin{displaymath} -\int\!dy\, p(y\vert x,{h}) \ln\frac{p(y\vert x,a)}{p(y\vert... ...dy\, p(y\vert x,{h}) \frac{p(y\vert x,a)}{p(y\vert x,{h})} =0 \end{displaymath}$

(81)

$\begin{displaymath} \Rightarrow -\int\!dy\, p(y\vert x,{h}) \ln p(y\vert x,a) \ge -\int\!dy\, p(y\vert x,{h}) \ln p(y\vert x,{h}) , \end{displaymath}$

(82)

with equality for

. Hence

$\displaystyle r(a,{h})$	$\textstyle =$	$\displaystyle -\int\!dx \int\!dy\, p(x) p(y\vert x,{h}) \left( b_1(x) \ln p(y\vert x,a) - b_2(x,y) \right)$	(83)
	$\textstyle =$	$\displaystyle -\int\!dx\, p(x)b_1(x) \int\!dy\, p(y\vert x,{h}) \ln p(y\vert x,a) + {\rm const.}$	(84)
	$\textstyle \ge$	$\displaystyle -\int\!dx\, p(x)b_1(x) \int\!dy\, p(y\vert x,{h}) \ln p(y\vert x,{h}) + {\rm const.}$	(85)
	$\textstyle =$	$\displaystyle r({h},{h}) ,$	(86)

with equality for

. For $a\in F$ replace ${h}\in {H}$ by $f\in F$ . This proves Eqs. (78) and (79).

Next: Normalization, non-negativity, and specific Up: Bayesian framework Previous: General loss functions and Contents

Joerg_Lemm 2001-01-21