Introduction

The last decade has seen a rapidly growing interest in learning from observational data. Increasing computational resources enabled successful applications of empirical learning algorithms in various areas including, for example, time series prediction, image reconstruction, speech recognition, computer tomography, and inverse scattering and inverse spectral problems for quantum mechanical systems. Empirical learning, i.e., the problem of finding underlying general laws from observations, represents a typical inverse problem and is usually ill-posed in the sense of Hadamard [220,221,224,146,115,226]. It is well known that a successful solution of such problems requires additional a priori information. It is a priori information which controls the generalization ability of a learning system by providing the link between available empirical ``training'' data and unknown outcome in future ``test'' situations.

We will focus mainly on nonparametric approaches, formulated directly in terms of the function values of interest. Parametric methods, on the other hand, impose typically implicit restrictions which are often extremely difficult to relate to available a priori knowledge. Combined with a Bayesian framework [11,16,33,145,201,174,18,69,212,35,235,42,105,104], a nonparametric approach allows a very flexible and interpretable implementation of a priori information in form of stochastic processes. Nonparametric Bayesian methods can easily be adapted to different learning situations and have therefore been applied to a variety of empirical learning problems, including regression, classification, density estimation and inverse quantum problems [170,237,143,142,138,222]. Technically, they are related to kernel and regularization methods which often appear in the form of a roughness penalty approach [221,224,192,211,153,228,90,83,116,226]. Computationally, working with stochastic processes, or discretized versions thereof, is more demanding than, for example, fitting a small number of parameters. This holds especially for such applications where one cannot take full advantage of the convenient analytical features of Gaussian processes. Nevertheless, it seems to be the right time to study nonparametric Bayesian approaches also for non-Gaussian problems as they become computationally feasible now at least for low dimensional systems and, even if not directly solvable, they provide a well defined basis for further approximations.

In this paper we will in particular study general density estimation problems. Those include, as special cases, regression, classification, and certain types of clustering. In density estimation the functions of interest are the probability densities $p(y\vert x,h)$ , of producing output (``data'')

under condition

and unknown state of Nature

. Considered as function of

, for fixed

, the function $p(y\vert x,h)$ is also known as likelihood function and a Bayesian approach to density estimation is based on a probabilistic model for likelihoods $p(y\vert x,h)$ . We will concentrate on situations where

and

are real variables, possibly multi-dimensional. In a nonparametric approach, the variable

represents the whole likelihood function $p(y\vert x,h)$ . That means,

may be seen as the collection of the numbers $0\le p(y\vert x,h)\le 1$ for all

and all

. The dimension of

is thus infinite, if the number of values which the variables

and/or

can take is infinite. This is the case for real

and/or

A learning problem with discrete

variable is also called a classification problem. Restricting to Gaussian probabilities $p(y\vert x,h)$ with fixed variance leads to (Gaussian) regression problems. For regression problems the aim is to find an optimal regression function

. Similarly, adapting a mixture of Gaussians allows soft clustering of data points. Furthermore, extracting relevant features from the predictive density $p(y\vert x,{\rm data})$ is the Bayesian analogue of unsupervised learning. Other special density estimation problems are, for example, inverse problems in quantum mechanics where

represents a unknown potential to be determined from observational data [143,142,138,222]. Special emphasis will be put on the explicit and flexible implementation of a priori information using, for example, mixtures of Gaussian prior processes with adaptive, non-zero mean functions for the mixture components.

Let us now shortly explain what is meant by the term ``Bayesian Field Theory'': From a physicists point of view functions, like

= $p(y\vert x,h)$ , depending on continuous variables

and/or

, are often called a `field'.²Most times in this paper we will, as common in field theories in physics, not parameterize these fields but formulate the relevant probability densities or stochastic processes, like the prior

or the posterior $p(h\vert f)$ , directly in terms of the field values

, e.g., $p(h\vert f)$ = $p(h(x,y),x\in X,y\in Y\vert f)$ . (In the parametric case, discussed in Chapter 4, we obtain a probability density $p(h\vert f)$ = $p(\xi\vert f)$ for fields $h(x,y,\xi)$ parameterized by $\xi$ .)

The possibility to solve Gaussian integrals analytically makes Gaussian processes, or (generalized) free fields in the language of physicists, very attractive for nonparametric learning. Unfortunately, only the case of Gaussian regression is completely Gaussian. For general density estimation problems the likelihood terms are non-Gaussian, and even for Gaussian priors additional non-Gaussian restrictions have to be included to ensure non-negativity and normalization of densities. Hence, in the general case, density estimation corresponds to a non-Gaussian, i.e., interacting field theory.

As it is well known from physics, a continuum limit for non-Gaussian theories, based on the definition of a renormalization procedure, can be highly nontrivial to construct. (See [20,5] for an renormalization approach to density estimation.) We will in the following not discuss such renormalization procedures but focus more on practical, numerical learning algorithms, obtained by discretizing the problem (typically, but not necessarily in coordinate space). This is similar to what is done in lattice field theories.

Gaussian problems live effectively in a space with dimension not larger than the number of training data. This is not the case for non-Gaussian problems. Hence, numerical implementations of learning algorithms for non-Gaussian problems require to discretize the functions of interest. This can be computationally challenging.

For low dimensional problems, however, many non-Gaussian models are nowadays solvable on a standard PC. Examples include predictions of one-dimensional time series or the reconstruction of two-dimensional images. Higher dimensional problems require additional approximations, like projections into lower dimensional subspaces or other variational approaches. Indeed, it seems that a most solvable high dimensional problems live effectively in some low dimensional subspace.

There are special situations in classification where non-negativity and normalization constraints are fulfilled automatically. In that case, the calculations can still be performed in a space of dimension not larger than the number of training data. Contrasting Gaussian models, however the equations to be solved are then typically nonlinear.

Summarizing, we will call a nonparametric Bayesian model to learn a function one or more continuous variables a Bayesian field theory, having especially in mind non-Gaussian models. A large variety of Bayesian field theories can be constructed by combining a specific likelihood models with specific functional priors (see Tab. 1). The resulting flexibility of nonparametric Bayesian approaches is probably their main advantage.

Table 1: A Bayesian approach is based on the combination of two models, a likelihood model, describing the measurement process used to obtain the training data, and a prior model, enabling generalization to non-training data. Parameters of the prior model are commonly called hyperparameters. In ``nonparametric'' approaches the collection of all values of the likelihood function itself are considered as the parameters. A nonparametric Bayesian approach for likelihoods depending on one or more real variables is in this paper called a Bayesian field theory.

likelihood model	prior model
describes
measurement process (Chap. 2)	generalization behavior (Chap. 2)
is determined by
parameters (Chap. 3, 4)	hyperparameters (Chap. 5)
Examples include
$\!$ $\!$ $\!$ $\!$ $\!$ density estimation(Sects. 3.1-3.6, 6.2) $\!$	hard constraints (Chap. 2)
regression (Sects. 3.7, 6.3)	Gaussian prior factors (Chap. 3)
classification (Sect. 3.8)	$\!$ mixtures of Gauss. (Sects. 6.1-6.4) $\!$
inverse quantum theory (Sect. 3.9)	$\!$ $\!$ non-quadratic potentials(Sect. 6.5) $\!$ $\!$ $\!$ $\!$
Learning algorithms are treated in Chapter 7.

The paper is organized as follows: Chapter 2 summarizes the Bayesian framework as needed for the subsequent chapters. Basic notations are defined, an introduction to Bayesian decision theory is given, and the role of a priori information is discussed together with the basics of a Maximum A Posteriori Approximation (MAP), and the specific constraints for density estimation problems. Gaussian prior processes, being the most commonly used prior processes in nonparametric statistics, are treated in Chapter 3. In combination with Gaussian prior models, this section also introduces the likelihood models of density estimation, (Sections 3.1, 3.2, 3.3) Gaussian regression and clustering (Section 3.7), classification (Section 3.8), and inverse quantum problems (Section 3.9). Notice, however, that all these likelihood models can also be combined with the more elaborated prior models discussed in the following sections of the paper. Parametric approaches, useful if a numerical solution of a full nonparametric approach is not feasible, are the topic of Chapter 4. Hyperparameters, parameterizing prior processes and making them more flexible, are considered in Section 5. Two possibilities to go beyond Gaussian processes, mixture models and non-quadratic potentials, are presented in Section 6. Chapter 7 focuses on learning algorithms, i.e., on methods to solve the stationarity equations resulting from a Maximum A Posteriori Approximation. In this section one can also find numerical solutions of Bayesian field theoretical models for general density estimation.