Posterior and likelihood

Next: Predictive density Up: Basic model and notations Previous: Energies, free energies, and Contents

Posterior and likelihood

Bayesian approaches require the calculation of posterior densities. Model states are commonly specified by giving the data generating probabilities or likelihoods $p(y\vert x,{h})$ . Posteriors are linked to likelihoods by Bayes' theorem

$\begin{displaymath} p(A\vert B)=\frac{p(B\vert A)p(A)}{p(B)} , \end{displaymath}$

(14)

which follows at once from the definition of conditional probabilities, i.e.,

= $p(A\vert B)p(B)$ = $p(B\vert A)p(A)$ . Thus, one finds

$\begin{displaymath} p({h}\vert f) = p({h}\vert D,D_0) = \frac{p(D\vert{h}) \,... ...c{p(y_D\vert x_D,{h}) \, p({h}\vert D_0)}{p(y_D\vert x_D,D_0)} \end{displaymath}$

(15)

$\begin{displaymath} = \frac{p({h}\vert D_0)\prod_i p(x_i, y_i\vert{h}) } {\int... ...{\int \!d{h}\, p({h}\vert D_0) \prod_i p(y_i\vert x_i,{h}) } , \end{displaymath}$

(16)

using $p(y_D\vert x_D,D_0,{h})$ = $p(y_D\vert x_D,{h})$ for the training data likelihood of

and $p({h}\vert D_0,x_i)$ = $p({h}\vert D_0)$ . The terms of Eq. (15) are in a Bayesian context often referred to as

$\begin{displaymath} {\rm posterior} = \frac{{\rm likelihood} \times {\rm prior}}{{\rm evidence}} . \end{displaymath}$

(17)

Eqs.(16) show that the posterior can be expressed equivalently by the joint likelihoods $p(y_i,x_i\vert{h})$ or conditional likelihoods $p(y_i\vert x_i,{h})$ . When working with joint likelihoods, a distinction between

and

variables is not necessary. In that case

can be included in

and skipped from the notation. If, however,

is already known or is not of interest working with conditional likelihoods is preferable. Eqs.(15,16) can be interpreted as updating (or learning) formula used to obtain a new posterior from a given prior probability if new data

arrive.

In terms of energies Eq. (16) reads,

$\begin{displaymath} p({h}\vert f) = \frac{e^{-\beta \sum_i E(y_i\vert x_i,{h}) -... ...}} {Z(Y_D\vert x_D,{h^\prime})\, Z({H}\vert D_0)}\right)^{-1}, \end{displaymath}$

(18)

where the same temperature $1/\beta$ has been chosen for both energies and the normalization constants are

$\displaystyle Z(Y_D\vert x_D,{h})$	$\textstyle =$	$\displaystyle \prod_i \int \!dy_i\, e^{-\beta E(y_i\vert x_i,{h}) } ,$	(19)
$\displaystyle Z({H}\vert D_0)$	$\textstyle =$	$\displaystyle \int \!d{h}\, e^{- \beta E({h}\vert D_0) } .$	(20)

The predictive density we are interested in can be written as the ratio of two correlation functions under $p_0 ({h})$ ,

$\displaystyle p(y\vert x,f)$	$\textstyle =$	$\displaystyle <p(y\vert x,{h}) >_{{H}\vert f}$	(21)
	$\textstyle =$	$\displaystyle \frac{<p(y\vert x,{h}) \prod_i p(y_i\vert x_i,{h}) >_{{H}\vert D_0} } {<\prod_i p(y_i\vert x_i,{h})>_{{H}\vert D_0} },$	(22)
	$\textstyle =$	$\displaystyle \frac{\int \!d{h}\, p(y\vert x,{h}) \, e^{-\beta E_{\rm comb}}} {\int \!d{h}\, e^{-\beta E_{\rm comb}}}$	(23)

where $< \cdots >_{{H}\vert D_0}$ denotes the expectation under the prior density $p_0 ({h})$ = $p({h}\vert D_0)$ and the combined likelihood and prior energy $E_{\rm comb}$ collects the

-dependent energy and free energy terms

$\begin{displaymath} E_{\rm comb} = \sum_i E(y_i\vert x_i,{h}) + E({h}\vert D_0) - F(Y_D\vert x_D,{h}), \end{displaymath}$

(24)

with

$\begin{displaymath} F(Y_D\vert x_D,{h}) = -\frac{1}{\beta} \ln Z(Y_D\vert x_D,{h}) . \end{displaymath}$

(25)

Going from Eq. (22) to Eq. (23) the normalization factor $Z({H}\vert D_0)$ appearing in numerator and denominator has been canceled.

We remark that for continuous and/or the likelihood energy term $E(y_i\vert x_i,{h})$ describes an ideal, non-realistic measurement because realistic measurements cannot be arbitrarily sharp. Considering the function $p(\cdot\vert\cdot,{h})$ as element of a Hilbert space its values may be written as scalar product $p(y\vert x,{h})$ = $(v_{xy},\, p(\cdot\vert\cdot,{h})\,)$ with a function $v_{xy}$ being also an element in that Hilbert space. For continuous and/or this notation is only formal as $v_{xy}$ becomes unnormalizable. In practice a measurement of $p(\cdot\vert\cdot,{h})$ corresponds to a normalizable $v_{\tilde x\tilde y}$ = $\int \!dy \int \!dx \,\vartheta (x,y) v_{xy}$ where the kernel $\vartheta (x,y)$ has to ensure normalizability. (Choosing normalizable $v_{\tilde x\tilde y}$ as coordinates the Hilbert space of $p(\cdot\vert\cdot,{h})$ is also called a reproducing kernel Hilbert space [183,112,113,228,144].) The data terms then become

$\begin{displaymath} p(\tilde y_i\vert\tilde x_i,{h}) =\frac{\int \!dy \, dx \,\... ..., dy \, dx\vartheta (\tilde y, \tilde y;x,y) p(y,x\vert{h})} . \end{displaymath}$

(26)

The notation $p(y_i\vert x_i,{h})$ is understood as limit $\vartheta (x,y)\rightarrow \delta (x-x_i)\delta (y-y_i)$ and means in practice that $\vartheta (x,y)$ is very sharply centered. We will assume that the discretization, finally necessary to do numerical calculations, will implement such an averaging.

Next: Predictive density Up: Basic model and notations Previous: Energies, free energies, and Contents

Joerg_Lemm 2001-01-21