Next: A numerical example Up: Bayesian Field Theory Nonparametric Previous: Basic definitions

Maximum A Posteriori Approximation

In this paper, our aim will be to reconstruct the field $\phi$ from observational data in Maximum A Posteriori Approximation (MAP), i.e, by maximizing

$\begin{displaymath} p(\phi\vert D) \propto p(y_D\vert x_D,\phi) \, p_0(\phi) , \end{displaymath}$

(2)

with respect to $\phi$ . [Alternatively, one can use Monte-Carlo methods to approximate the Bayesian predictive density $p(y\vert x,D)$ .] Implementing the normalization conditions on

(for all

) by a Lagrange term with multiplier function $\Lambda_X(x)$ , introducing an additional Gaussian process prior term for $\phi$ [2] (with zero mean and inverse covariance ${\bf K}$ , real symmetric, positive definite), and considering the usual case where the positivity constraint is not active at the stationary point (or automatically satisfied by the choice of $\phi$ ) so a corresponding Lagrange multiplier can be skipped, we have therefore to minimize the error functional,

$\begin{displaymath} E_\phi = -\mbox{$<\!\ln P(\phi)\,\vert\,N\!>$} +\frac{1}{2} ... ...\,{\bf K}\phi\!>$} +\mbox{$<\!P(\phi)\,\vert\,\Lambda_X\!>$} . \end{displaymath}$

(3)

Here

denotes the empirical density, i.e.,

= $\sum_{i=1}^n\delta(x-x_i)\delta(y-y_i)$ , and the bra-ket notation $\mbox{$<\!\cdot\,\vert\,\cdot\!>$}$ , denoting scalar products, stands for integration over

and

. (Thus, the error functional (3) is, up to a $\phi$ -independent constant and the constraint implementation by Lagrange multiplier terms, the negative logarithm of the posterior (2). Prior terms can be made more flexible by including hyperparameters [1,6]. See [4] for a more detailed discussion.) Notice that a Gaussian prior term in the field $\phi$ is in general non-Gaussian in terms of the conditional likelihood

The MAP equation to be solved is easily found by setting the functional derivative of (3) with respect to $\phi(x,y)$ to zero, yielding,

$\begin{displaymath} 0 = {\bf P}^\prime (\phi) {\bf P}^{-1}(\phi) N - {{\bf K}}\phi -{\bf P}^\prime (\phi) \Lambda_X , \end{displaymath}$

(4)

where ${\bf P}^\prime (x,y;x^\prime,y^\prime)$ = $\frac{\delta P (x^\prime,y^\prime)}{\delta \phi (x,y)}$ and ${\bf P}(x,y;x^\prime,y^\prime)$ = $\delta(x-x^\prime)\delta(y-y^\prime)P(x,y)$ . For existing ${\bf P} {{\bf P}^\prime}^{-1}$ = $({\bf P}^\prime {\bf P}^{-1})^{-1}$ this can be written

$\begin{displaymath} 0 = N - {\bf P} {{\bf P}^\prime}^{-1}{{\bf K}}\, \phi - {\bf P} \Lambda_X . \end{displaymath}$

(5)

From the normalization condition ${\bf I}_X P$ =

[where ${\bf I}_X(x,y;x^\prime,y^\prime)$ = $\delta(x-x^\prime)$ and

], it follows for $\Lambda_X(x,y)$ = $\Lambda_X(x) \ne 0$ that ${\bf I}_X {\bf P} \Lambda_X$ = $\Lambda_X$ , and thus, multiplying Eq. (5) by ${\bf I}_X$ ,

$\begin{displaymath} \Lambda_X = {\bf I}_X \left( N - {\bf P} {{\bf P}^\prime}^{... ...- {\bf I}_X {\bf P} {{\bf P}^\prime}^{-1} {{\bf K}}\, \phi , \end{displaymath}$

(6)

where

= $\sum_i \delta(x-x_i)$ . $\Lambda_X$ can now be eliminated by inserting Eq. (6) into Eq. (5)

$\begin{displaymath} 0 = \left( {\bf I} - {\bf P}{\bf I}_X\right) \left( N - {\bf P} {{\bf P}^\prime}^{-1}{{\bf K}}\, \phi \right). \end{displaymath}$

(7)

In contrast to regression with Gaussian process priors [12] (where MAP is exact), and similarly, classification for a specific choice of $P(\phi)$ [11], the solution of Eq. (7) cannot be found in a finite dimensional space defined by the training data . Thus, Eq. (7) has to be solved by discretization, analogously, for example, to solving field theories on a lattice [7]. Starting with an initial guess $\phi^{(0)}$ , a possible iteration scheme is given by

$\begin{displaymath} \phi^{(r+1)} = \phi^{(r)} + \eta^{(r)} \left({\bf A}^{(r)}\... ...e}^{(r)}\right)^{-1}{{\bf K}}\, \phi^{(r}) \right) \right] , \end{displaymath}$

(8)

with some positive definite matrix ${\bf A}^{(r)}$ , and a number $\eta^{(r)}>0$ , both possibly changing during the iteration. A convenient

-independent choice for ${\bf A}^{(r)}$ is often the inverse covariance ${\bf K}$ . (Quasi-Newton methods, for example, use an

-dependent ${\bf A}^{(r)}$ , the standard gradient algorithm corresponds to choosing for ${\bf A}^{(r)}$ the identity matrix.) For each calculated gradient, the factor $\eta$ can be determined by a one-dimensional line search algorithm. An iteration scheme similar to Eq. (8), can be obtained using Eq. (4), which then requires to adapt the Lagrange multiplier function $\Lambda_X(x)$ during iteration.

Clearly, a direct discretization can only be useful for low dimensional and variables (say one- or two-dimensional, like time series or images). Due to increasing computing capabilities, however, many low-dimensional problems are now directly solvable by discretization [3]. For variables or living in higher dimensional spaces additional approximations are necessary [4].

Next: A numerical example Up: Bayesian Field Theory Nonparametric Previous: Basic definitions

Joerg_Lemm 2000-09-12