In classification
(or pattern recognition) tasks
the independent visible variable
takes discrete values (group, cluster or pattern labels)
[16,61,24,47].
We write
= and
= ,
i.e.,
= .
Having received
classification data =
the density estimation error functional
for a prior on function
(with components
and = )
reads
(335) |
(336) |
For zero-one loss =
-- a typical loss function for classification problems --
the optimal decision (or Bayes classifier) is given by the mode
of the predictive density
(see Section 2.2.2), i.e.,
(337) |
For the choice non-negativity and normalization must be ensured. For with non-negativity is automatically fulfilled but the Lagrange multiplier must be included to ensure normalization.
Normalization is guaranteed by using unnormalized probabilities , (for which non-negativity has to be checked) or shifted log-likelihoods with , i.e., = . In that case the nonlocal normalization terms are part of the likelihood and no Lagrange multiplier has to be used [236]. The resulting equation can be solved in the space defined by the -data (see Eq. (153)). The restriction of = to linear functions yields log-linear models [154]. Recently a mean field theory for Gaussian Process classification has been developed [177,179].
Table 3 lists some special cases of density estimation. The last line of the table, referring to inverse quantum mechanics, will be discussed in the next section.