Next: Gaussian regression Up: Mixtures of Gaussian process Previous: Introduction Contents

The Bayesian model

Let us consider the following random variables:

1.: , representing (a vector of) independent, visible variables (`measurement situations'),
2.: , being (a vector of) dependent, visible variables (`measurement results'), and
3.: , being the hidden variables (`possible states of Nature').

A Bayesian approach is based on two model inputs [1,11,4,12]:

1.: A likelihood model $p(y\vert x,h)$ , describing the density of observing given and . Regarded as function of , for fixed and , the density $p(y\vert x,h)$ is also known as the (-conditional) likelihood of .
2.: A prior model $p(h\vert D_0)$ , specifying the a priori density of given some a priori information denoted by (but before training data have been taken into account).

Furthermore, to decompose a possibly complicated prior density into simpler components, we introduce continuous hyperparameters $\theta$ and discrete hyperparameters (extending the set of hidden variables to $\tilde h$ = $(h,\theta,j)$ ),

$\begin{displaymath} p(h\vert D_0) = \int \!d\theta \sum_j p(h,\theta,j\vert D_0) . \end{displaymath}$

(1)

In the following, the summation over

will be treated exactly, while the $\theta$ -integral will be approximated. A Bayesian approach aims in calculating the predictive density for outcomes

in test situations

$\begin{displaymath} p(y\vert x,D) = \int \!dh\, p(y\vert x,h)\, p(h\vert D) , \end{displaymath}$

(2)

given data

= $\{D_T,D_0\}$ consisting of a priori data

and i.i.d. training data

= $\{(x_i,y_i)\vert 1\le i\le n\}$ . The vector of all

(

) will be denoted

. Fig.1 shows a graphical representation of the considered probabilistic model.

In saddle point approximation (maximum a posteriori approximation) the -integral becomes

$\begin{displaymath} p(y\vert x,D) \approx p(y\vert x,h^*) , \end{displaymath}$

(3)

$\begin{displaymath} h^*=\,{\rm argmax}_{h\in{\cal H}} p(h\vert D) , \end{displaymath}$

(4)

assuming $p(y\vert x,h)$ to be slowly varying at the stationary point. The posterior density is related to (

-conditional) likelihood and prior according to Bayes' theorem

$\begin{displaymath} p(h\vert D) =\frac{p(y_T\vert x_T,h)\,p(h\vert D_0)}{p(y_T\vert x_T,D_0)} , \end{displaymath}$

(5)

where the

-independent denominator (evidence) can be skipped when maximising with respect to

. Treating the $\theta$ -integral within $p(h\vert D)$ also in saddle point approximation the posterior must be maximised with respect to

and $\theta$ simultaneously .

**Figure 1:** Graphical representation of the considered probabilistic model, factorising according to $p(x_T,y_T,x,y,h,\theta ,j,(\beta )\vert D)$ = $p(y_T\vert x_T,h,(\beta ))$ $p(y\vert x,h,(\beta ))$ $p(h\vert\theta ,j,D_0,(\beta ))$ $p(\theta ,j,(\beta )\vert D_0)$ . (The variable $\beta$ is introduced in Section 3.) Circles indicate visible variables.
$\begin{figure}\begin{center} \setlength{\unitlength}{1mm}\begin{picture}(60,50) ... ...kebox(0,0){$D_0$}} \put(20,10){\circle{6}} \end{picture}\end{center}\end{figure}$

Next: Gaussian regression Up: Mixtures of Gaussian process Previous: Introduction Contents

Joerg_Lemm 1999-12-21