General loss functions and unsupervised learning

Next: Maximum A Posteriori Approximation Up: Bayesian decision theory Previous: Loss functions for approximation Contents

General loss functions and unsupervised learning

Choosing actions in specific situations often requires the use of specific loss functions. Such loss functions may for example contain additional terms measuring costs of choosing action not related to approximation of the predictive density. Such costs can quantify aspects like the simplicity, implementability, production costs, sparsity, or understandability of action .

Furthermore, instead of approximating a whole density it often suffices to extract some of its features, like identifying clusters of similar -values, finding independent components for multidimensional , or mapping to an approximating density with lower dimensional . This kind of exploratory data analysis is the Bayesian analogue to unsupervised learning methods. Such methods are on one hand often utilized as a preprocessing step but are, on the other hand, also important to choose actions for situations where specific loss functions can be defined.

From a Bayesian point of view general loss functions require in general an explicit two-step procedure [132]: 1. Calculate (an approximation of) the predictive density, and 2. Minimize the expectation of the loss function under that (approximated) predictive density. (Empirical risk minimization, on the other hand, minimizes the empirical average of the (possibly regularized) loss function, see Section 2.5.) (For a related example see for instance [139].)

For a Bayesian version of cluster analysis, for example, partitioning a predictive density obtained from empirical data into several clusters, a possible loss function is

$\begin{displaymath} l(x,y,a) = (y-a(x,y) )^2 , \end{displaymath}$

(64)

with action

being a mapping of

for given

to a finite number of cluster centers (prototypes). Another example of a clustering method based on the predictive density is Fukunaga's valley seeking procedure [61].

For multidimensional a space of actions $a( {\bf P}_x x,y)$ can be chosen depending only on a (possibly adaptable) lower dimensional projection of .

For multidimensional with components it is often useful to identify independent components. One may look, say, for a linear mapping $\tilde y$ = ${\bf M} y$ minimizing the correlations between different components of the `source' variables $\tilde y$ by minimizing the loss function

$\begin{displaymath} l(y,y^\prime,{\bf M}) = \sum_{i\ne j} \tilde y_i\, \tilde y_j^\prime , \end{displaymath}$

(65)

with respect to ${\bf M}$ under the joint predictive density for

and $y^\prime$ given $x,x^\prime,D,D_0$ . This includes a Bayesian version of blind source separation (e.g. applied to the so called cocktail party problem [14,7]), analogous to the treatment of Molgedey and Schuster [162]. Interesting projections of multidimensional

can for example be found by projection pursuit techniques [59,102,108,211].

Next: Maximum A Posteriori Approximation Up: Bayesian decision theory Previous: Loss functions for approximation Contents

Joerg_Lemm 2001-01-21