next up previous contents
Next: An Example: Regression Up: The Bayesian approach Previous: The Bayesian approach   Contents

Basic notations

A Bayesian approach is based upon two main ingredients:

1.
A model of Nature, i.e., a space ${\cal H}$ of hypotheses $h$ defined by their likelihood functions $p(x\vert c,h)$. Likelihood functions specify the probability density for producing outcome $x$ (measured value or dependent visible variable, assumed to be directly observable) under hypothesis $h$ (possible state of Nature or hidden variable, assumed to be not directly observable) and condition $c$ (measurement device parameters or independent visible variable, assumed to be directly observable).

2.
A prior density $p_0(h)$ = $p(h\vert D_0)$ defined over the space ${\cal H}$ of hypotheses, $D_0$ denoting collectively all available a priori information.

Now assume (new) training data $D_T$ = $(x_T,c_T)$ = $\{(x_i,c_i)\vert 1\le i\le n\}$ become available, consisting of pairs of measured values $x_i$ under known conditions $c_i$ (and unknown $h$). Then Bayes' theorem

\begin{displaymath}
p(h\vert D)
=\frac{p(x_T\vert c_T,h)\,p_0(h)}{p_0(x_T\vert c_T)}
,
\end{displaymath} (1)

is used to update the prior density $p_0(h)$ = $p(h\vert D_0)$ to get the (new) posterior density $p(h\vert D)$ = $p(h\vert D_T,D_0)$. Here we wrote $D$ = $(D_T,D_0)$ to denote both, training data and a priori information. Assuming i.i.d. training data $D_T$ the likelihoods factorize $p(x_T\vert c_T,h)$ = $\prod_i^n p(x_i\vert c_i,h)$. Note that the denominator which appears in Eq. (1), i.e., $p_0(x_T\vert c_T)$ = $\int\! dh\, p(x_T\vert c_T,h)\,p_0(h)$, is $h$-independent. It plays the role of a normalization factor, also known as evidence. Thus, the terms in Eq. (1) are named as follows,
\begin{displaymath}
{\rm posterior}
= \frac{{\rm likelihood}\times{\rm prior}}{\rm evidence}
.
\end{displaymath} (2)

To make predictions, a Bayesian approach aims at calculating the predictive density

\begin{displaymath}
p(x\vert c,D)
= \int \!dh\, p(x\vert c,h)\, p(h\vert D)
,
\end{displaymath} (3)

which is a likelihood average weighted by their posterior probability. The $h$-integral can be extremely high dimensional, and often, like in the case we are focusing on here, even be a functional integral [39,40] over a space of likelihood functions ${\cal H}$. In as far as an analytical integration is not possible, one has to treat the integral, for example, by Monte Carlo methods [30, 41-44] or in saddle point approximation [26,30,45,46]. Assuming the likelihood term $p(x\vert c,h)$ to be slowly varying at the stationary point the latter is also known as maximum posterior approximation. In this approximation the $h$-integration is effectively replaced by a maximization of the posterior, meaning the predictive density is approximated by
\begin{displaymath}
p(x\vert c,D) \approx p(x\vert c,h^*),
\end{displaymath} (4)

where
\begin{displaymath}
h^*
= {\rm argmax}_{h\in{\cal H}} p(h\vert D)
= {\rm argmin}_{h\in{\cal H}} [-\ln p(h\vert D)]
.
\end{displaymath} (5)

The term $-\ln p(h\vert D)$ is also often referred to as (regularized) error functional and indeed, a maximum posterior approximation is technically equivalent to minimizing error functionals with Tikhonov regularization [2-4, 47]. The difference between the Bayesian approach and the classical Tikhonov regularization is the interpretation of the extra term as costs or as a priori information, respectively.

Within a maximum likelihood approach an optimal hypothesis $h$ is obtained by maximizing only its training likelihood $p(x_T\vert c_T,h)$ instead of its complete posterior. This is equivalent to a maximum posterior approximation with uniform prior density. A maximum likelihood approach can be used for hypotheses $h$ = $h(\xi)$, parameterized by (vectors) $\xi $. A maximum likelihood approach is possible if that parameterization is restrictive enough (and well enough adapted to the problem), so no additional prior is required to allow generalization from training to non-training data. For completely flexible nonparametric approaches, however, the prior term is necessary to provide the necessary information linking training data and (future) non-training data. Indeed, if every number $p(x\vert c,h)$ is considered as a single degree of freedom [restricted only by the positivity constraint $p(x\vert c,h) > 0$ and the normalization over $x$] then, without a priori information, training data contain no information about non-training data.


next up previous contents
Next: An Example: Regression Up: The Bayesian approach Previous: The Bayesian approach   Contents
Joerg_Lemm 2000-06-06