next up previous contents
Next: Maximum A Posteriori Approximation Up: Bayesian decision theory Previous: Loss functions for approximation   Contents

General loss functions and unsupervised learning

Choosing actions $a$ in specific situations often requires the use of specific loss functions. Such loss functions may for example contain additional terms measuring costs of choosing action $a$ not related to approximation of the predictive density. Such costs can quantify aspects like the simplicity, implementability, production costs, sparsity, or understandability of action $a$.

Furthermore, instead of approximating a whole density it often suffices to extract some of its features, like identifying clusters of similar $y$-values, finding independent components for multidimensional $y$, or mapping to an approximating density with lower dimensional $x$. This kind of exploratory data analysis is the Bayesian analogue to unsupervised learning methods. Such methods are on one hand often utilized as a preprocessing step but are, on the other hand, also important to choose actions for situations where specific loss functions can be defined.

From a Bayesian point of view general loss functions require in general an explicit two-step procedure [132]: 1. Calculate (an approximation of) the predictive density, and 2. Minimize the expectation of the loss function under that (approximated) predictive density. (Empirical risk minimization, on the other hand, minimizes the empirical average of the (possibly regularized) loss function, see Section 2.5.) (For a related example see for instance [139].)

For a Bayesian version of cluster analysis, for example, partitioning a predictive density obtained from empirical data into several clusters, a possible loss function is

\begin{displaymath}
l(x,y,a) = (y-a(x,y) )^2
,
\end{displaymath} (64)

with action $a(x,y)$ being a mapping of $y$ for given $x$ to a finite number of cluster centers (prototypes). Another example of a clustering method based on the predictive density is Fukunaga's valley seeking procedure [61].

For multidimensional $x$ a space of actions $a( {\bf P}_x x,y)$ can be chosen depending only on a (possibly adaptable) lower dimensional projection of $x$.

For multidimensional $y$ with components $y_i$ it is often useful to identify independent components. One may look, say, for a linear mapping $\tilde y$ = ${\bf M} y$ minimizing the correlations between different components of the `source' variables $\tilde y$ by minimizing the loss function

\begin{displaymath}
l(y,y^\prime,{\bf M})
=
\sum_{i\ne j} \tilde y_i\, \tilde y_j^\prime
,
\end{displaymath} (65)

with respect to ${\bf M}$ under the joint predictive density for $y$ and $y^\prime$ given $x,x^\prime,D,D_0$. This includes a Bayesian version of blind source separation (e.g. applied to the so called cocktail party problem [14,7]), analogous to the treatment of Molgedey and Schuster [162]. Interesting projections of multidimensional $y$ can for example be found by projection pursuit techniques [59,102,108,211].


next up previous contents
Next: Maximum A Posteriori Approximation Up: Bayesian decision theory Previous: Loss functions for approximation   Contents
Joerg_Lemm 2001-01-21