Normalization, non-negativity, and specific priors

Next: Empirical risk minimization Up: Bayesian framework Previous: Maximum A Posteriori Approximation Contents

Normalization, non-negativity, and specific priors

Density estimation problems are characterized by their normalization and non-negativity condition for $p(y\vert x,{h})$ . Thus, the prior density $p({h}\vert D_0)$ can only be non-zero for such for which $p(y\vert x,{h})$ is positive and normalized over for all . (Similarly, when solving for a distribution function, i.e., the integral of a density, the non-negativity constraint is replaced by monotonicity and the normalization constraint by requiring the distribution function to be 1 on the right boundary.) While the non-negativity constraint is local with respect to and , the normalization constraint is nonlocal with respect to . Thus, implementing a normalization constraint leads to nonlocal and in general non-Gaussian priors.

For classification problems, having discrete values (classes), the normalization constraint requires simply to sum over the different classes and a Gaussian prior structure with respect to the -dependency is not altered [236]. For general density estimation problems, however, i.e., for continuous , the loss of the Gaussian structure with respect to is more severe, because non-Gaussian functional integrals can in general not be performed analytically. On the other hand, solving the learning problem numerically by discretizing the and variables, the normalization term is typically not a severe complication.

To be specific, consider a Maximum A Posteriori Approximation, minimizing

$\begin{displaymath} \beta E_{\rm comb} =-\sum_i L(y_i\vert x_i,{h}) + \beta E({h}\vert D_0) , \end{displaymath}$

(87)

where the likelihood free energy $F(Y_D\vert x_D,{h})$ is included, but not the prior free energy $F({H}\vert D_0)$ which, being

-independent, is irrelevant for minimization with respect to

. The prior energy $\beta E({h}\vert D_0)$ has to implement the non-negativity and normalization conditions

$\displaystyle Z_X(x,{h})= \int \!dy_i\, p(y_i\vert x_i,{h}) = 1,$		$\displaystyle \forall x_i \in X_i, \forall {h}\in {H}$	(88)
$\displaystyle p(y_i\vert x_i,{h}) \ge 0,$		$\displaystyle \forall y_i \in Y_i, \forall x_i \in X_i, \forall {h}\in {H} .$	(89)

It is useful to isolate the normalization condition and non-negativity constraint defining the class of density estimation problems from the rest of the problem specific priors. Introducing the specific prior information $\tilde D_0$ so that

= $\{ \tilde D_0, {\rm normalized,positive} \}$ , we have

$\begin{displaymath} p({h}\vert\tilde D_0,{\rm norm.,pos.}) = \frac{p({\rm norm.,... ...})p({h}\vert\tilde D_0)}{p({\rm norm.,pos.}\vert\tilde D_0)} , \end{displaymath}$

(90)

with deterministic, $\tilde D_0$ -independent

$\begin{displaymath} p({\rm norm.,pos.}\vert{h}) = p({\rm norm.,pos.}\vert{h}, \tilde D_0) \end{displaymath}$

(91)

$\begin{displaymath} =p({\rm norm.}\vert{h}) p({\rm pos.}\vert{h}) = \delta (Z_X - 1) \prod_{xy}\Theta \Big(p(y\vert x,h)\Big) , \end{displaymath}$

(92)

and step function $\Theta$ . ( The density $p({\rm norm.}\vert{h})$ is normalized over all possible normalizations of $p(y\vert x,h)$ , i.e., over all possible values of

, and $p({\rm pos.}\vert{h})$ over all possible sign combinations.) The

-independent denominator $p({\rm norm.,pos.}\vert\tilde D_0)$ can be skipped for error minimization with respect to

. We define the specific prior as

$\begin{displaymath} {p({h}\vert\tilde D_0)} \propto e^{-E({h}\vert\tilde D_0)} . \end{displaymath}$

(93)

In Eq. (93) the specific prior appears as posterior of a -generating process determined by the parameters $\tilde D_0$ . We will call therefore Eq. (93) the posterior form of the specific prior. Alternatively, a specific prior can also be in likelihood form

$\begin{displaymath} p(\tilde D_0,h\vert{\rm norm.,pos.}) = p(\tilde D_0\vert h) \,p(h\vert{\rm norm.,pos.}) . \end{displaymath}$

(94)

As the likelihood $p(\tilde D_0\vert h)$ is conditioned on

this means that the normalization

= $\int \!d\tilde D_0 \,e^{-E(\tilde D_0\vert{h})}$ remains in general

-dependent and must be included when minimizing with respect to

. However, Gaussian specific priors with

-independent covariances have the special property that according to Eq. (70) likelihood and posterior interpretation coincide. Indeed, representing Gaussian specific prior data $\tilde D_0$ by a mean function $t_{\tilde D_0}$ and covariance ${{\bf K}}^{-1}$ (analogous to standard training data in the case of Gaussian regression, see also Section 3.5) one finds due to the fact that the normalization of a Gaussian is independent of the mean (for uniform (meta) prior

)

$\displaystyle p({h}\vert\tilde D_0)$	$\textstyle =$	$\displaystyle \frac{e^{-\frac{1}{2}( {h}-t_{\tilde D_0}, {{\bf K}} ({h}-t_{\til... ...int\!d{h}\,e^{-\frac{1}{2}( {h}-t_{\tilde D_0},{{\bf K}}({h}-t_{\tilde D_0}))}}$	(95)
$\displaystyle =p(t_{\tilde D_0}\vert{h},{{\bf K}})$	$\textstyle =$	$\displaystyle \frac{e^{-\frac{1}{2}( {h}-t_{\tilde D_0}, {{\bf K}} ({h}-t_{\tilde D_0}))}} {\int\!dt\,e^{-\frac{1}{2}( {h}-t, {{\bf K}} ({h}-t))}} .$	(96)

Thus, for Gaussian $p(t_{\tilde D_0}\vert{h},{{\bf K}})$ with

-independent normalization the specific prior energy in likelihood form becomes analogous to Eq. (93)

$\begin{displaymath} {p(t_{\tilde D_0}\vert{h},{{\bf K}})} \propto e^{-E(t_{\tilde D_0}\vert{h},{{\bf K}})} , \end{displaymath}$

(97)

and specific prior energies can be interpreted both ways.

Similarly, the complete likelihood factorizes

$\begin{displaymath} p(\tilde D_0,{\rm norm.,pos.}\vert{h}) = p({\rm norm.,pos.}\vert{h}) \, p(\tilde D_0 \vert {h}) . \end{displaymath}$

(98)

According to Eq. (92) non-negativity and normalization conditions are implemented by step and $\delta$ -functions. The non-negativity constraint is only active when there are locations with $p(y\vert x,h)$ = . In all other cases the gradient has no component pointing into forbidden regions. Due to the combined effect of data, where $p(y\vert x,h)$ has to be larger than zero by definition, and smoothness terms the non-negativity condition for $p(y\vert x,{h})$ is usually (but not always) fulfilled. Hence, if strict positivity is checked for the final solution, then it is not necessary to include extra non-negativity terms in the error (see Section 3.2.1). For the sake of simplicity we will therefore not include non-negativity terms explicitly in the following. In case a non-negativity constraint has to be included this can be done using Lagrange multipliers, or alternatively, by writing the step functions in $p({\rm pos.}\vert h) \propto \prod_{x,y} \Theta (p(y\vert x,{h}))$

$\begin{displaymath} \Theta(x-a) = \int_a^\infty \!d\xi \int_{-\infty}^{\infty} d\eta e^{i\eta(\xi-x)} , \end{displaymath}$

(99)

and solving the $\xi$ -integral in saddle point approximation (See for example [63,64,65]).

Including the normalization condition in the prior $p_0({h}\vert D_0)$ in form of a $\delta$ -functional results in a posterior probability

$\begin{displaymath} p({h}\vert f) \! \propto e^{ \sum_i L_i(y_i\vert x_i,{h}) -... ...x\in X} \delta \left( \int\!dy\,e^{L(y\vert x,{h})} -1 \right) \end{displaymath}$

(100)

with constant $\tilde c({H}\vert\tilde D_0)$ = $-\ln \tilde Z({h}\vert\tilde D_0)$ related to the normalization of the specific prior $e^{-E({h}\vert\tilde D_0)}$ . Writing the $\delta$ -functional in its Fourier representation

$\begin{displaymath} \delta (x) = \frac{1}{2 \pi} \int_{-\infty}^{\infty} \!dk\, ... ...\frac{1}{2 \pi i} \int_{-i\infty}^{i\infty} \!dk\, e^{- k x }, \end{displaymath}$

(101)

i.e.,

$\begin{displaymath} \delta ( \int \! dy \, e^{L(y\vert x,{h})}-1) = \frac{1}{2 \... ...X (x) \left( 1- \int \! dy \, e^{L(y\vert x,{h})} \right) } , \end{displaymath}$

(102)

and performing a saddle point approximation with respect to $\Lambda_X (x)$ (which is exact in this case) yields

$\begin{displaymath} p({h}\vert f) \propto e^{ \sum_i L_i(y_i\vert x_i,{h}) - E(... ...ambda_X (x) \left( 1-\int\!\! dy e^{L(y\vert x,{h})} \right)}. \end{displaymath}$

(103)

This is equivalent to the Lagrange multiplier approach. Here the stationary $\Lambda_X (x)$ is the Lagrange multiplier vector (or function) to be determined by the normalization conditions for $p(y\vert x,{h})=e^{L(y\vert x,{h})}$ . Besides the Lagrange multiplier terms it is numerically sometimes useful to add additional terms to the log-posterior which vanish for normalized $p(y\vert x,{h})$ .

Next: Empirical risk minimization Up: Bayesian framework Previous: Maximum A Posteriori Approximation Contents

Joerg_Lemm 2001-01-21