An important special case of density estimation
leading to quadratic data terms
is regression for independent training data
with Gaussian likelihoods

(245) |

We have remarked in Section 2.3 that for continuous measurement of has to be understood as measurement of a = for sharply peaked . We assume here that the discretization of used in numerical calculations takes care of that averaging. Divergent quantities like -functionals, used here for convenience, will then not be present.

We now combine Gaussian data terms and a Gaussian (specific) prior
with prior operator
and define for training data , the operator

with

(253) |

and -independent minimal errors,

(255) | |||

(256) |

being proportional to the ``generalized variances'' = and = . The scalar product stands for -integration only, for the sake of simplicity however, we will skip the subscript in the following. The data operator

(257) |

(258) |

(259) |

Notice that the sum is not over all training points but only over the different . (Again for continuous an integration around is required to ensure = ). Hence, the data template becomes the mean of -values measured at

(261) |

The stationarity equation is most easily obtained from (252),

(262) |

We remark that can be invertible (and usually is so the learning problem is well defined) even if is not invertible. The inverse , necessary to calculate , is training data dependent and represents the covariance operator/matrix of a Gaussian posterior process. In many practical cases, however, the prior covariance (or in case of a null space a pseudo inverse of ) is directly given or can be calculated. Then an inversion of a finite dimensional matrix in data space is sufficient to find the minimum of the energy [228,76].

**Invertible **:
Let us first study the case of an invertible
and consider the stationarity equation
as obtained from (248) or (250)

For existing

(266) |

to obtain

Inserting Eq. (268) into Eq. (267) one finds an equation for

Multiplying Eq. (269) from the left by the projector defined in (260) and using

(270) |

where

(272) |

where

(274) |

(275) |

(276) |

Eq. (277) can also be obtained directly from Eq. (263) and the definitions (254), without introducing the auxiliary variable , using the decomposition = + and

(278) |

(279) |

is also known as equivalent kernel due to its relation to kernel smoothing techniques [210,94,90,76].

Interestingly, Eq. (268) still holds for non-quadratic data terms of the form with any differentiable function fulfilling = , where = is the restriction of to data space. Hence, also the function of functional derivatives with respect to is restricted to data space, i.e., = with = and . For example, = with a differentiable function. The finite dimensional vector is then found by solving a nonlinear equation instead of a linear one [73,75].

Furthermore, one can study vector fields, i.e., the case where,
besides possibly ,
also , and thus , is a vector for given .
(Considering the variable indicating the vector components of
as part of the -variable,
this is a situation where a fixed number of one-dimensional ,
corresponding to a subspace of with fixed dimension,
is always measured simultaneously.)
In that case the diagonal of Eq. (246)
can be replaced by a version with non-zero off-diagonal elements
between the vector components of .
This corresponds
to a multi-dimensional Gaussian data generating probability

(280) |

**Non-invertible **:
For non-invertible
one can solve for
using the Moore-Penrose inverse
.
Let us first recall some basic facts
[58,164,186,187,126,15,120].
A pseudo inverse of (a possibly non-square)
is defined by the conditions

(282) |

(283) |

In that case the solution is

where is solution of the homogeneous equation and vector is arbitrary. Hence, can be expanded in an orthonormalized basis of the null space of

(286) |

For a diagonal

(288) |

becomes the projector on the space where is invertible. Similarly, = is the projector on the null space of .

For a general (possible non-square) there always exist orthogonal and so it can be written in orthogonal normal form = , where is of the form of Eq. (287) [120]. The corresponding Moore-Penrose inverse is = .

Similarly, for an which can be diagonalized, i.e.,
for a square matrix =
with diagonal ,
the Moore-Penrose inverse is
=
.
From Eq. (289) it follows then that

(291) |

(292) |

(293) |

(294) |

(295) |

(296) |

Now we apply this to Eq. (265)
where is
diagonalizable because being positive semi-definite.
(In this case is an orthogonal matrix
and the entries of are real and larger or equal to zero.)
Hence, one obtains under the condition

where = so that can be expanded in an orthonormalized basis of the null space of , assumed here to be of finite dimension. To find an equation in data space define the vector

to get from Eqs.(297) and (298)

These equations have to be solved for and the coefficients . Inserting Eq. (301) into the definition (299) gives

using = according to Eq. (290). Using = [where , the projector on the space of training data, is defined in (260)] the solvability condition (297) becomes

(303) |

(304) |

(305) |

Again, general non-quadratic data terms can be allowed.
In that case
=
=
and Eq. (299)
becomes the nonlinear equation

(306) |