next up previous contents
Next: Gaussian regression Up: Mixtures of Gaussian process Previous: Introduction   Contents


The Bayesian model

Let us consider the following random variables:

1.
$x$, representing (a vector of) independent, visible variables (`measurement situations'),
2.
$y$, being (a vector of) dependent, visible variables (`measurement results'), and
3.
$h$, being the hidden variables (`possible states of Nature').
A Bayesian approach is based on two model inputs [1,11,4,12]:
1.
A likelihood model $p(y\vert x,h)$, describing the density of observing $y$ given $x$ and $h$. Regarded as function of $h$, for fixed $y$ and $x$, the density $p(y\vert x,h)$ is also known as the ($x$-conditional) likelihood of $h$.
2.
A prior model $p(h\vert D_0)$, specifying the a priori density of $h$ given some a priori information denoted by $D_0$ (but before training data $D_T$ have been taken into account).

Furthermore, to decompose a possibly complicated prior density into simpler components, we introduce continuous hyperparameters $\theta$ and discrete hyperparameters $j$ (extending the set of hidden variables to $\tilde h$ = $(h,\theta,j)$),

\begin{displaymath}
p(h\vert D_0) = \int \!d\theta \sum_j p(h,\theta,j\vert D_0)
.
\end{displaymath} (1)

In the following, the summation over $j$ will be treated exactly, while the $\theta$-integral will be approximated. A Bayesian approach aims in calculating the predictive density for outcomes $y$ in test situations $x$
\begin{displaymath}
p(y\vert x,D) = \int \!dh\, p(y\vert x,h)\, p(h\vert D)
,
\end{displaymath} (2)

given data $D$ = $\{D_T,D_0\}$ consisting of a priori data $D_0$ and i.i.d. training data $D_T$ = $\{(x_i,y_i)\vert 1\le i\le n\}$. The vector of all $x_i$ ($y_i$) will be denoted $x_T$ $(y_T)$. Fig.1 shows a graphical representation of the considered probabilistic model.

In saddle point approximation (maximum a posteriori approximation) the $h$-integral becomes

\begin{displaymath}
p(y\vert x,D) \approx p(y\vert x,h^*)
,
\end{displaymath} (3)


\begin{displaymath}
h^*=\,{\rm argmax}_{h\in{\cal H}} p(h\vert D)
,
\end{displaymath} (4)

assuming $p(y\vert x,h)$ to be slowly varying at the stationary point. The posterior density is related to ($x_T$-conditional) likelihood and prior according to Bayes' theorem
\begin{displaymath}
p(h\vert D)
=\frac{p(y_T\vert x_T,h)\,p(h\vert D_0)}{p(y_T\vert x_T,D_0)}
,
\end{displaymath} (5)

where the $h$-independent denominator (evidence) can be skipped when maximising with respect to $h$. Treating the $\theta$-integral within $p(h\vert D)$ also in saddle point approximation the posterior must be maximised with respect to $h$ and $\theta$ simultaneously .

Figure 1: Graphical representation of the considered probabilistic model, factorising according to $p(x_T,y_T,x,y,h,\theta ,j,(\beta )\vert D)$ = $p(x_T)$ $p(x)$ $p(y_T\vert x_T,h,(\beta ))$ $p(y\vert x,h,(\beta ))$ $p(h\vert\theta ,j,D_0,(\beta ))$ $p(\theta ,j,(\beta )\vert D_0)$. (The variable $\beta $ is introduced in Section 3.) Circles indicate visible variables.
\begin{figure}\begin{center}
\setlength{\unitlength}{1mm}\begin{picture}(60,50)
...
...kebox(0,0){$D_0$}}
\put(20,10){\circle{6}}
\end{picture}\end{center}\end{figure}


next up previous contents
Next: Gaussian regression Up: Mixtures of Gaussian process Previous: Introduction   Contents
Joerg_Lemm 1999-12-21