next up previous
Next: A numerical example Up: Bayesian Field Theory Nonparametric Previous: Basic definitions

Maximum A Posteriori Approximation

In this paper, our aim will be to reconstruct the field $\phi $ from observational data $D$ in Maximum A Posteriori Approximation (MAP), i.e, by maximizing

\begin{displaymath}
p(\phi\vert D) \propto p(y_D\vert x_D,\phi) \, p_0(\phi)
,
\end{displaymath} (2)

with respect to $\phi $. [Alternatively, one can use Monte-Carlo methods to approximate the Bayesian predictive density $p(y\vert x,D)$.] Implementing the normalization conditions on $P$ (for all $x$) by a Lagrange term with multiplier function $\Lambda_X(x)$, introducing an additional Gaussian process prior term for $\phi $ [2] (with zero mean and inverse covariance ${\bf K}$, real symmetric, positive definite), and considering the usual case where the positivity constraint is not active at the stationary point (or automatically satisfied by the choice of $\phi $) so a corresponding Lagrange multiplier can be skipped, we have therefore to minimize the error functional,
\begin{displaymath}
E_\phi =
-\mbox{$<\!\ln P(\phi)\,\vert\,N\!>$}
+\frac{1}{2} ...
...\,{\bf K}\phi\!>$}
+\mbox{$<\!P(\phi)\,\vert\,\Lambda_X\!>$}
.
\end{displaymath} (3)

Here $N$ denotes the empirical density, i.e., $N(x,y)$ = $\sum_{i=1}^n\delta(x-x_i)\delta(y-y_i)$, and the bra-ket notation $\mbox{$<\!\cdot\,\vert\,\cdot\!>$}$, denoting scalar products, stands for integration over $x$ and $y$. (Thus, the error functional (3) is, up to a $\phi $-independent constant and the constraint implementation by Lagrange multiplier terms, the negative logarithm of the posterior (2). Prior terms can be made more flexible by including hyperparameters [1,6]. See [4] for a more detailed discussion.) Notice that a Gaussian prior term in the field $\phi $ is in general non-Gaussian in terms of the conditional likelihood $P$.

The MAP equation to be solved is easily found by setting the functional derivative of (3) with respect to $\phi(x,y)$ to zero, yielding,

\begin{displaymath}
0 =
{\bf P}^\prime (\phi) {\bf P}^{-1}(\phi) N
- {{\bf K}}\phi
-{\bf P}^\prime (\phi) \Lambda_X
,
\end{displaymath} (4)

where ${\bf P}^\prime (x,y;x^\prime,y^\prime)$ = $\frac{\delta P (x^\prime,y^\prime)}{\delta \phi (x,y)}$ and ${\bf P}(x,y;x^\prime,y^\prime)$ = $\delta(x-x^\prime)\delta(y-y^\prime)P(x,y)$. For existing ${\bf P} {{\bf P}^\prime}^{-1}$ = $({\bf P}^\prime {\bf P}^{-1})^{-1}$ this can be written
\begin{displaymath}
0 =
N - {\bf P} {{\bf P}^\prime}^{-1}{{\bf K}}\, \phi
- {\bf P} \Lambda_X
.
\end{displaymath} (5)

From the normalization condition ${\bf I}_X P $ = $I$ [where ${\bf I}_X(x,y;x^\prime,y^\prime)$ = $\delta(x-x^\prime)$ and $I(x,y)$ = $1$], it follows for $\Lambda_X(x,y)$ = $\Lambda_X(x) \ne 0$ that ${\bf I}_X {\bf P} \Lambda_X$ = $\Lambda_X$, and thus, multiplying Eq. (5) by ${\bf I}_X$,
\begin{displaymath}
\Lambda_X =
{\bf I}_X \left( N - {\bf P} {{\bf P}^\prime}^{...
...- {\bf I}_X
{\bf P} {{\bf P}^\prime}^{-1} {{\bf K}}\, \phi
,
\end{displaymath} (6)

where $N_X(x,y)$ = $N_X(x)$ = $\sum_i \delta(x-x_i)$. $\Lambda_X$ can now be eliminated by inserting Eq. (6) into Eq. (5)
\begin{displaymath}
0 =
\left( {\bf I} - {\bf P}{\bf I}_X\right)
\left( N - {\bf P} {{\bf P}^\prime}^{-1}{{\bf K}}\, \phi \right).
\end{displaymath} (7)

In contrast to regression with Gaussian process priors [12] (where MAP is exact), and similarly, classification for a specific choice of $P(\phi)$ [11], the solution of Eq. (7) cannot be found in a finite dimensional space defined by the training data $D$. Thus, Eq. (7) has to be solved by discretization, analogously, for example, to solving field theories on a lattice [7]. Starting with an initial guess $\phi^{(0)}$, a possible iteration scheme is given by

\begin{displaymath}
\phi^{(r+1)} =
\phi^{(r)} + \eta^{(r)} \left({\bf A}^{(r)}\...
...e}^{(r)}\right)^{-1}{{\bf K}}\,
\phi^{(r}) \right)
\right]
,
\end{displaymath} (8)

with some positive definite matrix ${\bf A}^{(r)}$, and a number $\eta^{(r)}>0$, both possibly changing during the iteration. A convenient $r$-independent choice for ${\bf A}^{(r)}$ is often the inverse covariance ${\bf K}$. (Quasi-Newton methods, for example, use an $r$-dependent ${\bf A}^{(r)}$, the standard gradient algorithm corresponds to choosing for ${\bf A}^{(r)}$ the identity matrix.) For each calculated gradient, the factor $\eta$ can be determined by a one-dimensional line search algorithm. An iteration scheme similar to Eq. (8), can be obtained using Eq. (4), which then requires to adapt the Lagrange multiplier function $\Lambda_X(x)$ during iteration.

Clearly, a direct discretization can only be useful for low dimensional $x$ and $y$ variables (say one- or two-dimensional, like time series or images). Due to increasing computing capabilities, however, many low-dimensional problems are now directly solvable by discretization [3]. For variables $x$ or $y$ living in higher dimensional spaces additional approximations are necessary [4].


next up previous
Next: A numerical example Up: Bayesian Field Theory Nonparametric Previous: Basic definitions
Joerg_Lemm 2000-09-12