In Chapter 4.
parameterization of
have been studied.
This section now discusses parameterizations
of the prior density
.
For Gaussian prior densities
that means parameterization of mean and/or covariance.
The parameters of the prior functional,
which we will denote by
,
are in a Bayesian context also known as hyperparameters.
Hyperparameters
can be considered
as part of the hidden variables.
In a full Bayesian approach the
-integral
therefore has to be completed
by an integral over the additional hidden variables
.
Analogously, the prior densities
can be supplemented by priors for
,
also be called hyperpriors,
with corresponding energies
.
In saddle point approximation
thus an additional stationarity equation
will appear,
resulting from the derivative
with respect to
.
The saddle point approximation of the
-integration
(in the case of uniform hyperprior
and with the
-integral being calculated exactly or by approximation)
is also known as ML-II prior [16]
or evidence framework
[85,86,213,147,148,149,24].
There are some cases where it is convenient
to let the likelihood
depend, besides on a function
,
on a few additional parameters.
In regression such a parameter
can be the variance of the likelihood.
Another example is the inverse temperature
introduced in
Section 6.3,
which, like
also appears in the prior.
Such parameters may formally be added to
the ``direct'' hidden variables
yielding an enlarged
.
As those ``additional likelihood parameters''
are like other hyperparameters typically
just real numbers, and not functions like
,
they can often be treated analogously to hyperparameters.
For example, they may also be determined by cross-validation (see below)
or by a low dimensional integration.
In contrast to pure prior parameters, however,
the functional derivatives with respect to such
``additional likelihood parameters''
contain terms arising from the derivative of the likelihood.
Within the Frequentist interpretation of error minimization
as empirical risk minimization
hyperparameters
can be determined by
minimizing the empirical generalization error
on a new set of test or validation data
being independent from the training data
.
Here the empirical generalization error is meant to be
the pure data term
=
of the error functional
for
being the optimal
for the full regularized
at
and for given training data
.
Elaborated techniques include
cross-validation and bootstrap methods
which have been mentioned in Sections 2.5
and 4.9.
Within the Bayesian interpretation
of error minimization as posterior maximization
the introduction of hyperparameters
leads to a new difficulty.
The problem arises from the fact
that it is usually desirable
to interpret the error term
as prior energy for
, meaning that
| (420) |
| (421) |
![]() |
(422) |
| (423) |
| (424) |
It is interesting to look what happens
if
of Eq. (419)
is expressed
in terms of joint energy
as follows
![]() |
(425) |
| (426) |
Notice especially,
that this discussion also applies to the case
where
is assumed to be uniform
so it does not have to appear explicitly in the error functional.
The two ways of expressing
by a joint or conditional energy, respectively,
are equivalent
if the joint density factorizes.
In that case, however,
and
are independent,
so
cannot be used to parameterize the density of
.
Numerically the need to calculate
can be disastrous
because normalization factors
represent often an extremely high dimensional (functional) integral
and are, in contrast to the normalization of
over
,
very difficult to calculate.
There are, however,
situations for which
remains
-independent.
Let
stand for example
for a Gaussian specific prior
(with the normalization condition factored out
as in Eq. (90)).
Then, because the normalization of a Gaussian is independent
of its mean, parameterizing the mean
=
results in
a
-independent
.
Besides their mean, Gaussian processes are
characterized by their covariance operators
.
Because the normalization only depends on
a second possibility
yielding
-dependent
are parameterized transformations of the form
with orthogonal
=
.
Indeed, such transformations do not change the determinant
.
They are only non-trivial for multi-dimensional Gaussians.
For general parameterizations of density estimation problems, however,
the normalization term
must be included.
The only way to get rid of that normalization term
would be to assume a compensating hyperprior
| (427) |
Thus, in the general case we have to consider the functional
![]() |
(431) |
Finally, we want to remark that
in case function evaluation of
is much cheaper than calculating the gradient (430),
minimization methods not using the gradient
should be considered, like for example the
downhill simplex method
[196].