next up previous contents
Next: Predictive density Up: Basic model and notations Previous: Energies, free energies, and   Contents

Posterior and likelihood

Bayesian approaches require the calculation of posterior densities. Model states ${h}$ are commonly specified by giving the data generating probabilities or likelihoods $p(y\vert x,{h})$. Posteriors are linked to likelihoods by Bayes' theorem

\begin{displaymath}
p(A\vert B)=\frac{p(B\vert A)p(A)}{p(B)}
,
\end{displaymath} (14)

which follows at once from the definition of conditional probabilities, i.e., $p(A,B)$ = $p(A\vert B)p(B)$ = $p(B\vert A)p(A)$. Thus, one finds
\begin{displaymath}
p({h}\vert f)
=
p({h}\vert D,D_0)
=
\frac{p(D\vert{h}) \,...
...c{p(y_D\vert x_D,{h}) \, p({h}\vert D_0)}{p(y_D\vert x_D,D_0)}
\end{displaymath} (15)


\begin{displaymath}
= \frac{p({h}\vert D_0)\prod_i p(x_i, y_i\vert{h}) }
{\int...
...{\int \!d{h}\, p({h}\vert D_0) \prod_i p(y_i\vert x_i,{h}) }
,
\end{displaymath} (16)

using $p(y_D\vert x_D,D_0,{h})$ = $p(y_D\vert x_D,{h})$ for the training data likelihood of ${h}$ and $p({h}\vert D_0,x_i)$ = $p({h}\vert D_0)$. The terms of Eq. (15) are in a Bayesian context often referred to as
\begin{displaymath}
{\rm posterior} = \frac{{\rm likelihood} \times {\rm prior}}{{\rm evidence}}
.
\end{displaymath} (17)

Eqs.(16) show that the posterior can be expressed equivalently by the joint likelihoods $p(y_i,x_i\vert{h})$ or conditional likelihoods $p(y_i\vert x_i,{h})$. When working with joint likelihoods, a distinction between $y$ and $x$ variables is not necessary. In that case $x$ can be included in $y$ and skipped from the notation. If, however, $p(x)$ is already known or is not of interest working with conditional likelihoods is preferable. Eqs.(15,16) can be interpreted as updating (or learning) formula used to obtain a new posterior from a given prior probability if new data $D$ arrive.

In terms of energies Eq. (16) reads,

\begin{displaymath}
p({h}\vert f)
=
\frac{e^{-\beta \sum_i E(y_i\vert x_i,{h}) -...
...}}
{Z(Y_D\vert x_D,{h^\prime})\, Z({H}\vert D_0)}\right)^{-1},
\end{displaymath} (18)

where the same temperature $1/\beta$ has been chosen for both energies and the normalization constants are
$\displaystyle Z(Y_D\vert x_D,{h})$ $\textstyle =$ $\displaystyle \prod_i \int \!dy_i\, e^{-\beta E(y_i\vert x_i,{h}) }
,$ (19)
$\displaystyle Z({H}\vert D_0)$ $\textstyle =$ $\displaystyle \int \!d{h}\, e^{- \beta E({h}\vert D_0) }
.$ (20)

The predictive density we are interested in can be written as the ratio of two correlation functions under $p_0 ({h})$,

$\displaystyle p(y\vert x,f)$ $\textstyle =$ $\displaystyle <p(y\vert x,{h}) >_{{H}\vert f}$ (21)
  $\textstyle =$ $\displaystyle \frac{<p(y\vert x,{h}) \prod_i p(y_i\vert x_i,{h}) >_{{H}\vert D_0} }
{<\prod_i p(y_i\vert x_i,{h})>_{{H}\vert D_0} },$ (22)
  $\textstyle =$ $\displaystyle \frac{\int \!d{h}\, p(y\vert x,{h}) \, e^{-\beta E_{\rm comb}}}
{\int \!d{h}\, e^{-\beta E_{\rm comb}}}$ (23)

where $< \cdots >_{{H}\vert D_0}$ denotes the expectation under the prior density $p_0 ({h})$ = $p({h}\vert D_0)$ and the combined likelihood and prior energy $E_{\rm comb}$ collects the ${h}$-dependent energy and free energy terms
\begin{displaymath}
E_{\rm comb} =
\sum_i E(y_i\vert x_i,{h}) + E({h}\vert D_0) - F(Y_D\vert x_D,{h}),
\end{displaymath} (24)

with
\begin{displaymath}
F(Y_D\vert x_D,{h})
= -\frac{1}{\beta} \ln Z(Y_D\vert x_D,{h})
.
\end{displaymath} (25)

Going from Eq. (22) to Eq. (23) the normalization factor $Z({H}\vert D_0)$ appearing in numerator and denominator has been canceled.

We remark that for continuous $x$ and/or $y$ the likelihood energy term $E(y_i\vert x_i,{h})$ describes an ideal, non-realistic measurement because realistic measurements cannot be arbitrarily sharp. Considering the function $p(\cdot\vert\cdot,{h})$ as element of a Hilbert space its values may be written as scalar product $p(y\vert x,{h})$ = $(v_{xy},\, p(\cdot\vert\cdot,{h})\,)$ with a function $v_{xy}$ being also an element in that Hilbert space. For continuous $x$ and/or $y$ this notation is only formal as $v_{xy}$ becomes unnormalizable. In practice a measurement of $p(\cdot\vert\cdot,{h})$ corresponds to a normalizable $v_{\tilde x\tilde y}$ = $\int \!dy \int \!dx \,\vartheta (x,y) v_{xy}$ where the kernel $\vartheta (x,y)$ has to ensure normalizability. (Choosing normalizable $v_{\tilde x\tilde y}$ as coordinates the Hilbert space of $p(\cdot\vert\cdot,{h})$ is also called a reproducing kernel Hilbert space [183,112,113,228,144].) The data terms then become

\begin{displaymath}
p(\tilde y_i\vert\tilde x_i,{h})
=\frac{\int \!dy \, dx \,\...
..., dy \, dx\vartheta (\tilde y, \tilde y;x,y) p(y,x\vert{h})}
.
\end{displaymath} (26)

The notation $p(y_i\vert x_i,{h})$ is understood as limit $\vartheta (x,y)\rightarrow
\delta (x-x_i)\delta (y-y_i)$ and means in practice that $\vartheta (x,y)$ is very sharply centered. We will assume that the discretization, finally necessary to do numerical calculations, will implement such an averaging.


next up previous contents
Next: Predictive density Up: Basic model and notations Previous: Energies, free energies, and   Contents
Joerg_Lemm 2001-01-21