Quadratic density estimation and empirical risk minimization

Next: Regression Up: Gaussian prior factors Previous: Non-zero means Contents

Quadratic density estimation and empirical risk minimization

Interpreting an energy or error functional probabilistically, i.e., assuming $-\beta E +c$ to be the logarithm of a posterior probability under study, the form of the training data term has to be $-\sum_i \ln P_i$ . Technically, however, it would be easier to replace that data term by one which is quadratic in the function $\phi$ of interest.

Indeed, we have mentioned in Section 2.5 that such functionals can be justified within the framework of empirical risk minimization. From that Frequentist point of view an error functional , is not derived from a log-posterior, but represents an empirical risk $\hat r(P,f) = \sum_i l(x_i,y_i,P)$ , approximating an expected risk for action = . This is possible under the assumption that training data are sampled according to the true $p(x,y\vert f)$ . In that interpretation one is therefore not restricted to a log-loss for training data but may as well choose for training data a quadratic loss like

$\begin{displaymath} \frac{1}{2} \Big( P-P_{\rm emp},\, {{\bf K}_D}\,(P-P_{\rm emp})\Big) , \end{displaymath}$

(234)

choosing a reference density $P{\rm emp}$ and a real symmetric positive (semi) definite ${{\bf K}_D}$ .

Approximating a joint probability $p(x,y\vert h)$ the reference density $P_{\rm emp}$ would have to be the joint empirical density

$\begin{displaymath} P_{\rm emp}^{\rm joint} (x,y) = \frac{1}{n} \sum_i^n \delta (x-x_i) \delta(y-y_i) , \end{displaymath}$

(235)

i.e., $P_{\rm emp}^{\rm joint}$ =

, as obtained from the training data. Approximating conditional probabilities $p(y\vert x,h)$ the reference $P_{\rm emp}$ has to be chosen as conditional empirical density,

$\begin{displaymath} P_{\rm emp}(x,y) = \frac{\sum_i \delta(x-x_i)\delta(y-y_i)} {\sum_i \delta(x-x_i) } = \frac{N(x,y)}{n_x} , \end{displaymath}$

(236)

or, defining the diagonal matrix ${\bf N}_X(x,x^\prime,y,y^\prime)$ = $\delta(x-x^\prime)\delta(y-y^\prime) N_X(x)$ = $\delta(x-x^\prime)\delta(y-y^\prime) \sum_i \delta(x-x_i)$

$\begin{displaymath} P_{\rm emp} = {\bf N}_X^{-1}N . \end{displaymath}$

(237)

This, however, is only a valid expression if $N_X(x)\ne0$ , meaning that for all

at least one measured value has to be available. For

variables with a large number of possible values, this cannot be assumed. For continuous

variables it is even impossible.

Hence, approximating conditional empirical densities either non-data -values must be excluded from the integration in (234) by using an operator ${\bf K}_D$ containing the projector $\sum_{x^\prime\in x_D} \delta(x-x^\prime)$ , or $P_{\rm emp}$ must be defined also for such non-data -values. For existing = ${\bf I}_X 1$ = $\int \! dy\, 1$ , a possible extension $\tilde P_{\rm emp}$ of $P_{\rm emp}$ would be to assume a uniform density for non-data values, yielding

$\begin{displaymath} \tilde P_{\rm emp}(x,y) = \left\{ { \frac{\sum_i \delta(x-x... ...ad\qquad {\rm for} \quad \sum_i \delta(x-x_i) = 0 . } \right. \end{displaymath}$

(238)

This introduces a bias towards uniform probabilities, but has the advantage to give a empirical density for all

and to fulfill the conditional normalization requirements.

Instead of a quadratic term in , one might consider a quadratic term in the log-probability . The log-probability, however, is minus infinity at all non-data points $(x,y)\not\in D$ . To work with a finite expression, one can choose small $\epsilon (y)$ and approximate $P_{\rm emp}$ by

$\begin{displaymath} P^\epsilon_{\rm emp} (x,y) = \frac{\epsilon(y) + \sum_i \de... ...a(y-y_i)} {\int\!dy \, \epsilon (y) + \sum_i \delta(x-x_i)} , \end{displaymath}$

(239)

provided $\int\!dy \, \epsilon (y)$ exists. For $\epsilon (y)\ne 0$ also $P^\epsilon_{\rm emp} (x,y)\ne 0$ , $\forall x$ and $L^\epsilon_{\rm emp}$ = $\ln P^\epsilon_{\rm emp}>-\infty$ exists.

A quadratic data term in results in an error functional

$\begin{displaymath} \tilde E_P = \frac{1}{2} \Big(P-P_{\rm emp},\, {{\bf K}_D}\,... ...})\Big) + \frac{1}{2} (P,\, {{\bf K}}\,P) +(P,\, \Lambda_X ), \end{displaymath}$

(240)

skipping the constant part of the $\Lambda_X$ -terms. In (240) the empirical density $P_{\rm emp}$ may be replaced by $\tilde P_{\rm emp}$ of (238).

Positive (semi-)definite operators ${{\bf K}_D}$ have a square root and can be written in the form ${\bf R}^T{\bf R}$ . One possibility, skipping for the sake of simplicity in the following, is to choose as square root ${\bf R}$ the integration operator, i.e., ${\bf R}$ = $\bigotimes_k {\bf R}_k$ and ${\bf R} (y,y^\prime)$ = $\theta (y-y^\prime )$ . Thus, $\phi ={\bf R} P$ transforms the density function in the distribution function $\phi$ , and we have $P = P(\phi) = {\bf R}^{-1} \phi$ . Here the inverse ${\bf R}^{-1}$ is the differentiation operator $\prod_k \nabla_{y_k}$ (with appropriate boundary conditions) and $\left({\bf R}^T\right)^{-1}{\bf R}^{-1}$ = $-\prod_k \Delta_k$ is the product of one-dimensional Laplacians $\Delta_k = \partial^2 /\partial y_k^2$ . Adding for example a regularizing term $\frac{\lambda}{2}(P,\,P)$ as in (165) gives

$\begin{displaymath} \tilde E_P = \frac{1}{2} \Big(\,P-P_{\rm emp}\,,\, {\bf R}^T{\bf R}\,(P-P_{\rm emp})\,\Big) +\frac{\lambda}{2} (P,\,P) \end{displaymath}$

(241)

$\begin{displaymath} = \frac{1}{2} \left( \Big(\phi-\phi_{\rm emp},\, \phi-\phi_{... ... - \lambda \Big(\phi,\,\prod_k \Delta_k \,\phi\,\Big) \right) \end{displaymath}$

(242)

$\begin{displaymath} = \frac{1}{2 m^2} \Big(\phi,\, (- \prod_k \Delta_k + m^2 {\b... ..._{\rm emp}) + \frac{1}{2} (\phi_{\rm emp},\, \phi_{\rm emp}). \end{displaymath}$

(243)

with $m^2=\lambda^{-1}$ . Here the empirical distribution function $\phi_{\rm emp}$ = ${\bf R} P_{\rm emp}$ is given by $\phi_{\rm emp} (y)$ = $\frac{1}{n}\sum_i \theta (y-y_i)$ (or, including the

variable, $\phi_{\rm emp} (x,y)$ = $\frac{1}{N_X(x)}\sum_{x^\prime\in x_D} \delta(x-x^\prime)\theta (y-y_i)$ for $N_X(x)\ne0$ which could be extended to a linear $\tilde \phi$ = ${\bf R} \tilde P_{\rm emp}$ for

). The stationarity equation yields

$\begin{displaymath} \phi = m^2 \left( - \prod_k \Delta_k + m^2 {\bf I} \right)^{-1} \phi_{\rm emp}. \end{displaymath}$

(244)

For

(or $\phi = \prod_k \phi$ ) the operator becomes $\left( - \Delta + m^2{\bf I}\right)^{-1}$ which has the structure of a free massive propagator for a scalar field with mass

and is calculated in section 7.2.3. As already mentioned the normalization and non-negativity condition for

appear for $\phi$ as boundary and monotonicity conditions. For non-constant

the monotonicity condition has not to be implemented by Lagrange multipliers as the gradient at the stationary point has no components pointing into the forbidden area. (But the conditions nevertheless have to be checked.) Kernel methods of density estimation, like the use of Parzen windows, can be founded on such quadratic regularization functionals [224]. Indeed, the one-dimensional Eq. (244) is equivalent to the use of Parzen´s kernel in density estimation [182,169].

Next: Regression Up: Gaussian prior factors Previous: Non-zero means Contents

Joerg_Lemm 2001-01-21