next up previous contents
Next: Empirical risk minimization Up: Bayesian framework Previous: Maximum A Posteriori Approximation   Contents

Normalization, non-negativity, and specific priors

Density estimation problems are characterized by their normalization and non-negativity condition for $p(y\vert x,{h})$. Thus, the prior density $p({h}\vert D_0)$ can only be non-zero for such ${h}$ for which $p(y\vert x,{h})$ is positive and normalized over $y$ for all $x$. (Similarly, when solving for a distribution function, i.e., the integral of a density, the non-negativity constraint is replaced by monotonicity and the normalization constraint by requiring the distribution function to be 1 on the right boundary.) While the non-negativity constraint is local with respect to $x$ and $y$, the normalization constraint is nonlocal with respect to $y$. Thus, implementing a normalization constraint leads to nonlocal and in general non-Gaussian priors.

For classification problems, having discrete $y$ values (classes), the normalization constraint requires simply to sum over the different classes and a Gaussian prior structure with respect to the $x$-dependency is not altered [236]. For general density estimation problems, however, i.e., for continuous $y$, the loss of the Gaussian structure with respect to $y$ is more severe, because non-Gaussian functional integrals can in general not be performed analytically. On the other hand, solving the learning problem numerically by discretizing the $y$ and $x$ variables, the normalization term is typically not a severe complication.

To be specific, consider a Maximum A Posteriori Approximation, minimizing

\begin{displaymath}
\beta E_{\rm comb}
=-\sum_i L(y_i\vert x_i,{h}) + \beta E({h}\vert D_0)
,
\end{displaymath} (87)

where the likelihood free energy $F(Y_D\vert x_D,{h})$ is included, but not the prior free energy $F({H}\vert D_0)$ which, being ${h}$-independent, is irrelevant for minimization with respect to $h$. The prior energy $\beta E({h}\vert D_0)$ has to implement the non-negativity and normalization conditions
$\displaystyle Z_X(x,{h})=
\int \!dy_i\, p(y_i\vert x_i,{h}) = 1,$   $\displaystyle \forall x_i \in X_i, \forall {h}\in {H}$ (88)
$\displaystyle p(y_i\vert x_i,{h}) \ge 0,$   $\displaystyle \forall y_i \in Y_i,
\forall x_i \in X_i,
\forall {h}\in {H}
.$ (89)

It is useful to isolate the normalization condition and non-negativity constraint defining the class of density estimation problems from the rest of the problem specific priors. Introducing the specific prior information $\tilde D_0$ so that $D_0$ = $\{ \tilde D_0, {\rm normalized,positive} \}$, we have
\begin{displaymath}
p({h}\vert\tilde D_0,{\rm norm.,pos.})
=
\frac{p({\rm norm.,...
...})p({h}\vert\tilde D_0)}{p({\rm norm.,pos.}\vert\tilde D_0)}
,
\end{displaymath} (90)

with deterministic, $\tilde D_0$-independent
\begin{displaymath}
p({\rm norm.,pos.}\vert{h})
= p({\rm norm.,pos.}\vert{h}, \tilde D_0)
\end{displaymath} (91)


\begin{displaymath}
=p({\rm norm.}\vert{h}) p({\rm pos.}\vert{h})
= \delta (Z_X - 1) \prod_{xy}\Theta \Big(p(y\vert x,h)\Big)
,
\end{displaymath} (92)

and step function $\Theta$. ( The density $p({\rm norm.}\vert{h})$ is normalized over all possible normalizations of $p(y\vert x,h)$, i.e., over all possible values of $Z_X$, and $p({\rm pos.}\vert{h})$ over all possible sign combinations.) The ${h}$-independent denominator $p({\rm norm.,pos.}\vert\tilde D_0)$ can be skipped for error minimization with respect to ${h}$. We define the specific prior as
\begin{displaymath}
{p({h}\vert\tilde D_0)} \propto
e^{-E({h}\vert\tilde D_0)}
.
\end{displaymath} (93)

In Eq. (93) the specific prior appears as posterior of a ${h}$-generating process determined by the parameters $\tilde D_0$. We will call therefore Eq. (93) the posterior form of the specific prior. Alternatively, a specific prior can also be in likelihood form

\begin{displaymath}
p(\tilde D_0,h\vert{\rm norm.,pos.}) = p(\tilde D_0\vert h)
\,p(h\vert{\rm norm.,pos.})
.
\end{displaymath} (94)

As the likelihood $p(\tilde D_0\vert h)$ is conditioned on ${h}$ this means that the normalization $Z$ = $\int \!d\tilde D_0 \,e^{-E(\tilde D_0\vert{h})}$ remains in general ${h}$-dependent and must be included when minimizing with respect to ${h}$. However, Gaussian specific priors with ${h}$-independent covariances have the special property that according to Eq. (70) likelihood and posterior interpretation coincide. Indeed, representing Gaussian specific prior data $\tilde D_0$ by a mean function $t_{\tilde D_0}$ and covariance ${{\bf K}}^{-1}$ (analogous to standard training data in the case of Gaussian regression, see also Section 3.5) one finds due to the fact that the normalization of a Gaussian is independent of the mean (for uniform (meta) prior $p({h})$)
$\displaystyle p({h}\vert\tilde D_0)$ $\textstyle =$ $\displaystyle \frac{e^{-\frac{1}{2}( {h}-t_{\tilde D_0}, {{\bf K}} ({h}-t_{\til...
...int\!d{h}\,e^{-\frac{1}{2}( {h}-t_{\tilde D_0},{{\bf K}}({h}-t_{\tilde D_0}))}}$ (95)
$\displaystyle =p(t_{\tilde D_0}\vert{h},{{\bf K}})$ $\textstyle =$ $\displaystyle \frac{e^{-\frac{1}{2}( {h}-t_{\tilde D_0}, {{\bf K}} ({h}-t_{\tilde D_0}))}}
{\int\!dt\,e^{-\frac{1}{2}( {h}-t, {{\bf K}} ({h}-t))}}
.$ (96)

Thus, for Gaussian $p(t_{\tilde D_0}\vert{h},{{\bf K}})$ with ${h}$-independent normalization the specific prior energy in likelihood form becomes analogous to Eq. (93)
\begin{displaymath}
{p(t_{\tilde D_0}\vert{h},{{\bf K}})} \propto
e^{-E(t_{\tilde D_0}\vert{h},{{\bf K}})}
,
\end{displaymath} (97)

and specific prior energies can be interpreted both ways.

Similarly, the complete likelihood factorizes

\begin{displaymath}
p(\tilde D_0,{\rm norm.,pos.}\vert{h})
=
p({\rm norm.,pos.}\vert{h})
\, p(\tilde D_0 \vert {h})
.
\end{displaymath} (98)

According to Eq. (92) non-negativity and normalization conditions are implemented by step and $\delta$-functions. The non-negativity constraint is only active when there are locations with $p(y\vert x,h)$ = $0$. In all other cases the gradient has no component pointing into forbidden regions. Due to the combined effect of data, where $p(y\vert x,h)$ has to be larger than zero by definition, and smoothness terms the non-negativity condition for $p(y\vert x,{h})$ is usually (but not always) fulfilled. Hence, if strict positivity is checked for the final solution, then it is not necessary to include extra non-negativity terms in the error (see Section 3.2.1). For the sake of simplicity we will therefore not include non-negativity terms explicitly in the following. In case a non-negativity constraint has to be included this can be done using Lagrange multipliers, or alternatively, by writing the step functions in $p({\rm pos.}\vert h) \propto \prod_{x,y} \Theta (p(y\vert x,{h}))$

\begin{displaymath}
\Theta(x-a)
= \int_a^\infty \!d\xi \int_{-\infty}^{\infty} d\eta e^{i\eta(\xi-x)}
,
\end{displaymath} (99)

and solving the $\xi$-integral in saddle point approximation (See for example [63,64,65]).

Including the normalization condition in the prior $p_0({h}\vert D_0)$ in form of a $\delta$-functional results in a posterior probability

\begin{displaymath}
p({h}\vert f) \! \propto
e^{ \sum_i L_i(y_i\vert x_i,{h}) -...
...x\in X} \delta \left( \int\!dy\,e^{L(y\vert x,{h})} -1 \right)
\end{displaymath} (100)

with constant $\tilde c({H}\vert\tilde D_0)$ = $-\ln \tilde Z({h}\vert\tilde D_0)$ related to the normalization of the specific prior $e^{-E({h}\vert\tilde D_0)}$. Writing the $\delta$-functional in its Fourier representation
\begin{displaymath}
\delta (x)
= \frac{1}{2 \pi} \int_{-\infty}^{\infty} \!dk\, ...
...\frac{1}{2 \pi i} \int_{-i\infty}^{i\infty} \!dk\, e^{- k x },
\end{displaymath} (101)

i.e.,
\begin{displaymath}
\delta ( \int \! dy \, e^{L(y\vert x,{h})}-1)
= \frac{1}{2 \...
...X (x)
\left( 1- \int \! dy \, e^{L(y\vert x,{h})} \right) }
,
\end{displaymath} (102)

and performing a saddle point approximation with respect to $\Lambda_X (x)$ (which is exact in this case) yields
\begin{displaymath}
p({h}\vert f) \propto
e^{ \sum_i L_i(y_i\vert x_i,{h}) - E(...
...ambda_X (x) \left( 1-\int\!\! dy e^{L(y\vert x,{h})} \right)}.
\end{displaymath} (103)

This is equivalent to the Lagrange multiplier approach. Here the stationary $\Lambda_X (x)$ is the Lagrange multiplier vector (or function) to be determined by the normalization conditions for $p(y\vert x,{h})=e^{L(y\vert x,{h})}$. Besides the Lagrange multiplier terms it is numerically sometimes useful to add additional terms to the log-posterior which vanish for normalized $p(y\vert x,{h})$.


next up previous contents
Next: Empirical risk minimization Up: Bayesian framework Previous: Maximum A Posteriori Approximation   Contents
Joerg_Lemm 2001-01-21