next up previous contents
Next: Bayesian decision theory Up: Basic model and notations Previous: Predictive density   Contents


Mutual information and learning

The aim of learning is to generalize the information obtained from training data to non-training situations. For such a generalization to be possible, there must exist a, at least partially known, relation between the likelihoods $p(y_i\vert x_i,h)$ for training and for non-training data. This relation is typically provided by a priori knowledge.

One possibility to quantify the relation between two random variables $y_1$ and $y_2$, representing for example training and non-training data, is to calculate their mutual information, defined as

\begin{displaymath}
M(Y_1,Y_2)
=
\sum_{y_1\in Y_1,y_2\in Y_2} p(y_1,y_2)\ln \frac{p(y_1,y_2)}{p(y_1) p(y_2)}
.
\end{displaymath} (28)

It is also instructive to express the mutual information in terms of (average) information content or entropy, which, for a probability function $p(y)$, is defined as
\begin{displaymath}
H(Y) = - \sum_{y\in Y} p(y)\ln p(y)
.
\end{displaymath} (29)

We find
\begin{displaymath}
M(Y_1,Y_2)
= H(Y_1)+H(Y_2)-H(Y_1,Y_2)
,
\end{displaymath} (30)

meaning that the mutual information is the sum of the two individual entropies diminished by the entropy common to both variables.

To have a compact notation for a family of predictive densities $p(y_i\vert x_i,f)$ we choose a vector $x$ = $(x_1,x_2,\cdots )$ consisting of all possible values $x_i$ and corresponding vector $y$ = $(y_1,y_2,\cdots )$, so we can write

\begin{displaymath}
p(y\vert x,f)
=
p(y_1,y_2,\cdots \vert x_1,x_2,\cdots ,f).
\end{displaymath} (31)

We now would like to characterize a state of knowledge $f$ corresponding to predictive density $p(y\vert x,f)$ by its mutual information. Thus, we generalize the definition (28) from two random variables to a random vector $y$ with components $y_i$, given vector $x$ with components $x_i$ and obtain the conditional mutual information
\begin{displaymath}
M(Y\vert x,f) =
\int \left(\prod_i \,dy_i\right)
p(y\vert x,f) \ln \frac{p(y\vert x,f)}{\prod_j p(y_j\vert x_j,f)}
,
\end{displaymath} (32)

or
\begin{displaymath}
M(Y\vert x,f) =
\left(\int\!dy_i\, H(Y_i\vert x,f) - H(Y\vert x,f)\right)
,
\end{displaymath} (33)

in terms of conditional entropies
\begin{displaymath}
H(Y\vert x,f) = -
\int \!dy\, p(y\vert x,f) \ln p(y\vert x,f)
.
\end{displaymath} (34)

In case not a fixed vector $x$ is given, like for example $x$ = $(x_1,x_2,\cdots )$, but a density $p(x)$, it is useful to average the conditional mutual information and conditional entropy by including the integral $\int \!dx\, p(x)$ in the above formulae.

It is clear from Eq. (32) that predictive densities which factorize

\begin{displaymath}
p(y\vert x,f)
=
\prod_i p(y_i\vert x_i,f)
,
\end{displaymath} (35)

have a mutual information of zero. Hence, such factorial states do not allow any generalization from training to non-training data. A special example are the possible states of Nature or pure states $h$, which factorize according to the definition of our model
\begin{displaymath}
p(y\vert x,h)
=
\prod_i p(y_i\vert x_i,h)
.
\end{displaymath} (36)

Thus, pure states do not allow any further generalization. This is consistent with the fact that pure states represent the natural endpoints of any learning process.

It is interesting to see, however, that there are also other states for which the predictive density factorizes. Indeed, from Eq. (36) it follows that any (prior or posterior) probability $p(h)$ which factorizes leads to a factorial state,

\begin{displaymath}
p(h)
= \prod_i p(h(x_i))
\Rightarrow
p(y\vert x,f)
=
\prod_i p(y_i\vert x_i,f)
.
\end{displaymath} (37)

This means generalization, i.e., (non-local) learning, is impossible when starting from a factorial prior.

A factorial prior provides a very clear reference for analyzing the role of a-priori information in learning. In particular, with respect to a prior factorial in local variables $x_i$, learning may be decomposed into two steps, one increasing, the other lowering mutual information:

1.
Starting from a factorial prior, new non-local data $D_0$ (typically called a priori information) produce a new non-factorial state with non-zero mutual information.
2.
Local data $D$ (typically called training data) stochastically reduce the mutual information. Hence, learning with local data corresponds to a stochastic decay of mutual information.

Pure states, i.e., the extremal points in the space of possible predictive densities, do not have to be deterministic. Improving measurement devices, stochastic pure states may be further decomposed into finer components $g$, so that

\begin{displaymath}
p(y_i\vert x_i,h) = \int\! dg\, p(g) \, p(y_i\vert x_i,g)
.
\end{displaymath} (38)

Imposing a non-factorial prior $p(g)$ on the new, finer hypotheses $g$ enables again non-local learning with local data, leading asymptotically to one of the new pure states $p(y_i\vert x_i,g)$.

Let us exemplify the stochastic decay of mutual information by a simple numerical example. Because the mutual information requires the integration over all $y_i$ variables we choose a problem with only two of them, $y_a$ and $y_b$ corresponding to two $x$ values $x_a$ and $x_b$. We consider a model with four states of Nature $h_l$, $1\le l\le 4$, with Gaussian likelihood $p(y\vert x,h)$ = $(\sqrt{2 \pi} \sigma)^{-1}
\exp{\left(-(y- h_i(x))^2/(2\sigma^2)\right)}$ and local means $h_l(x_j)$ = $\pm 1$.

Selecting a ``true'' state of Nature $h$, we sample 50 data points $D_i$ = $(x_i,y_i)$ from the corresponding Gaussian likelihood using $p(x_a)$ = $p(x_b)$ = $0.5$. Then, starting from a given, factorial or non-factorial, prior $p(h\vert D_0)$ we sequentially update the predictive density,

\begin{displaymath}
p(y\vert x,f(D_{i+1},\cdots,D_{0})) =
\sum_{l=1}^4 p(y\vert x,h_l)\, p(h_l\vert D_{i+1},\cdots,D_{0})
,
\end{displaymath} (39)

by calculating the posterior
\begin{displaymath}
p(h_l\vert D_{i+1},\cdots,D_{0})
= \frac{p(y_{i+1}\vert x_{...
...}\cdots,D_{0})}
{p(y_{i+1}\vert x_{i+1},D_{i}\cdots,D_{0})}
.
\end{displaymath} (40)

It is easily seen from Eq. (40) that factorial states remain factorial under local data.

Fig. 3 shows that indeed the mutual information decays rapidly. Depending on the training data, still the wrong hypothesis $h_l$ may survive the decay of mutual information. Having arrived at a factorial state, further learning has to be local. That means, data points for $x_i$ can then only influence the predictive density for the corresponding $y_i$ and do not allow generalization to the other $y_j$ with $j\ne i$.

For a factorial prior $p(h_l)$ = $p(h_l(x_a))p(h_l(x_b))$ learning is thus local from the very beginning. Only very small numerical random fluctuations of the mutual information occur, quickly eliminated by learning. Thus, the predictive density moves through a sequence of factorial states.

Figure 3: The decay of mutual information during learning: Model with 4 possible states $h_l$ representing Gaussian likelihoods $p(y_i\vert x_i,h_l)$ with means $\pm 1$ for two different $x_i$ values. Shown are posterior probabilities $p(h_l\vert f)$ ($a$, $c$, $e$, on the left hand side, the posterior of the true $h_l$ is shown by a thick line) and mutual information $M(y)$ ($b$, $d$, $f$, on the right hand side) during learning 50 training data. ($a$, $b$): The mutual information decays during learning and becomes quickly practically zero. ($c$, $d$): For ``unlucky'' training data the wrong hypothesis $h_i$ can dominate at the beginning. Nevertheless, the mutual information decays and the correct hypothesis has finally to be found through ``local'' learning. ($e$, $f$): Starting with a factorial prior the mutual information is and remains zero, up to artificial numerical fluctuations. For ($e$, $f$) the same random data have been used as for ($c$, $d$).
\begin{figure}\begin{center}
\epsfig{file=ps/mutualDPP3.eps, width=65mm}\epsfig{...
...(e)}}
\put(130,50){\makebox(0,0){(f)}}
\end{picture}\vspace{-1.5cm}
\end{figure}


next up previous contents
Next: Bayesian decision theory Up: Basic model and notations Previous: Predictive density   Contents
Joerg_Lemm 2001-01-21