next up previous contents
Next: Learning matrices Up: Iteration procedures: Learning Previous: Iteration procedures: Learning   Contents

Numerical solution of stationarity equations

Due to the presence of the logarithmic data term $-(\ln P, N)$ and the normalization constraint in density estimation problems the stationary equations are in general nonlinear, even for Gaussian specific priors. An exception are Gaussian regression problems discussed in Section 3.7 for which $-(\ln P, N)$ becomes quadratic and the normalization constraint can be skipped. However, the nonlinearities arising from the data term $-(\ln P, N)$ are restricted to a finite number of training data points and for Gaussian specific priors one may expect them, like those arising from the normalization constraint, to be numerically not very harmful. Clearly, severe nonlinearities can appear for general non-Gaussian specific priors or general nonlinear parameterizations $P(\xi)$.

As nonlinear equations the stationarity conditions have in general to be solved by iteration. In the context of empirical learning iteration procedures to minimize an error functional represent possible learning algorithms.

In the previous sections we have encountered stationarity equations

\begin{displaymath}
0 = \frac{\delta (-E_\phi)}{\delta \phi} = G(\phi)
,
\end{displaymath} (607)

for error functionals $E_\phi$, e.g., $\phi $ = $L$ or $\phi $ = $P$, written in a form
\begin{displaymath}
{{\bf K}} \phi = T.
\end{displaymath} (608)

with $\phi $-dependent $T$ (and possibly ${\bf K}$). For the stationarity Eqs. (143), (172), and (193) the operator ${{\bf K}}$ is a $\phi $-independent inverse covariance of a Gaussian specific prior. It has already been mentioned that for existing (and not too ill-conditioned) ${{\bf K}}^{-1}$ (representing the covariance of the prior process) Eq. (608) suggests an iteration scheme
\begin{displaymath}
\phi^{(i+1)} = {{\bf K}}^{-1} T(\phi^{(i)})
,
\end{displaymath} (609)

for discretized $\phi $ starting from some initial guess $\phi^{(0)}$. In general, like for the non-Gaussian specific priors discussed in Section 6, ${{\bf K}}$ can be $\phi $-dependent. Eq. (368) shows that general nonlinear parameterizations $P(\xi)$ lead to nonlinear operators ${{\bf K}}$.

Clearly, if allowing $\phi $-dependent $T$, the form (608) is no restriction of generality. One always can choose an arbitrary invertible (and not too ill-conditioned) ${\bf A}$, define

\begin{displaymath}
T_{\bf A}= G(\phi) + {\bf A} \phi,
\end{displaymath} (610)

write a stationarity equation as
\begin{displaymath}
{\bf A} \phi\ = T_{\bf A},
\end{displaymath} (611)

discretize and iterate with ${\bf A}^{-1}$. To obtain a numerical iteration scheme we will choose a linear, positive definite learning matrix ${\bf A}$. The learning matrix may depend on $\phi $ and may also change during iteration.

To connect a stationarity equation given in form (608) to an arbitrary iteration scheme with a learning matrix ${\bf A}$ we define

\begin{displaymath}
{\bf B} = {{\bf K}} - {\bf A},
\quad
{\bf B}_\eta ={{\bf K}} - \frac{1}{\eta}{\bf A},
\end{displaymath} (612)

i.e., we split ${{\bf K}}$ into two parts
\begin{displaymath}
{{\bf K}}={\bf A} + {\bf B} = \frac{1}{\eta}{\bf A} + {\bf B}_\eta,
\end{displaymath} (613)

where we introduced $\eta $ for later convenience. Then we obtain from the stationarity equation (608)
\begin{displaymath}
\phi = \eta {\bf A}^{-1} (T - {\bf B}_\eta \phi ) .
\end{displaymath} (614)

To iterate we start by inserting an approximate solution $\phi^{(i)}$ to the right-hand side and obtain a new $\phi^{(i+1)}$ by calculating the left hand side. This can be written in one of the following equivalent forms
$\displaystyle \phi^{(i+1)}$ $\textstyle =$ $\displaystyle \eta {\bf A}^{-1} \left( T^{(i)} - {\bf B}_\eta \phi^{(i)} \right)$ (615)
  $\textstyle =$ $\displaystyle (1-\eta) \phi^{(i)} + \eta {\bf A}^{-1} \left( T^{(i)} - {\bf B} \phi^{(i)} \right)$ (616)
  $\textstyle =$ $\displaystyle \phi^{(i)} + \eta {\bf A}^{-1} \left( T^{(i)} - {{\bf K}} \phi^{(i)} \right),$ (617)

where $\eta $ plays the role of a learning rate or step width, and ${\bf A}^{-1}$ = $\left({\bf A}^{(i)}\right)^{-1}$ may be iteration dependent. The update equations (615-617) can be written
\begin{displaymath}
\Delta \phi^{(i)} = \eta {\bf A}^{-1} G (\phi^{(i)}),
\end{displaymath} (618)

with $\Delta \phi^{(i)}$ = $\phi^{(i+1)} - \phi^{(i)}$. Eq. (617) does not require the calculation of ${\bf B}$ or ${\bf B}_\eta$ so that instead of ${\bf A}$ directly ${\bf A}^{-1}$ can be given without the need to calculate its inverse. For example operators approximating ${{\bf K}}^{-1}$ and being easy to calculate may be good choices for ${\bf A}^{-1}$.

For positive definite ${\bf A}$ (and thus also positive definite inverse) convergence can be guaranteed, at least theoretically. Indeed, multiplying with $(1/\eta ) {\bf A}$ and projecting onto an infinitesimal $d\phi$ gives

\begin{displaymath}
\frac{1}{\eta} (\, d\phi,\, {\bf A}\,\Delta \phi\,) =
\Big...
...delta \phi}\Bigg\vert _{\phi=\phi^{(i)}}\,\Big)
= d(-E_\phi).
\end{displaymath} (619)

In an infinitesimal neighborhood of $\phi^{(i)}$ where ${\Delta \phi^{(i)}}$ becomes equal to $d\phi$ in first order the left-hand side is for positive (semi-)definite ${\bf A}$ larger (or equal) to zero. This shows that at least for $\eta $ small enough the posterior log-probability $-E_\phi$ increases i.e., the differential $dE_\phi$ is smaller or equal to zero and the value of the error functional $E_\phi$ decreases.

Stationarity equation (127) for minimizing $E_L$ yields for (615,616,617),

$\displaystyle L^{(i+1)}$ $\textstyle =$ $\displaystyle \eta {\bf A}^{-1} \left( N - {\bf\Lambda}_X^{(i)} e^{L^{(i)}}
- {{\bf K}} L^{(i)} + \frac{1}{\eta} {\bf A} L^{(i)}\right)$ (620)
  $\textstyle =$ $\displaystyle (1-\eta) L^{(i)} + \eta {\bf A}^{-1} \left( N - {\bf\Lambda}_X^{(i)} e^{L^{(i)}}
- {{\bf K}} L^{(i)} +{\bf A} L^{(i)} \right)$ (621)
  $\textstyle =$ $\displaystyle L^{(i)} + \eta {\bf A}^{-1} \left( N - {\bf\Lambda}_X^{(i)} e^{L^{(i)}}
- {{\bf K}} L^{(i)} \right).$ (622)

The function $\Lambda_X^{(i)}$ is also unknown and is part of the variables we want to solve for. The normalization conditions provide the necessary additional equations, and the matrix ${\bf A}^{-1}$ can be extended to include the iteration procedure for $\Lambda_X$. For example, we can insert the stationarity equation for $\Lambda_X$ in (622) to get
\begin{displaymath}
L^{(i+1)} =
L^{(i)} + \eta {\bf A}^{-1}
\left[ N - e^{{\bf ...
...)}} (N_X-{\bf I}_X {{\bf K}} L)
- {{\bf K}} L^{(i)} \right] .
\end{displaymath} (623)

If normalizing $L^{(i)}$ at each iteration this corresponds to an iteration procedure for $g = L + \ln Z_X$.

Similarly, for the functional $E_P$ we have to solve (166) and obtain for (617),

$\displaystyle P^{(i+1)}$ $\textstyle =$ $\displaystyle P^{(i)} + \eta {\bf A}^{-1} \left( T^{(i)}_P-{{\bf K}} P^{(i)} \right)$ (624)
  $\textstyle =$ $\displaystyle P^{(i)} + \eta {\bf A}^{-1} \left( {\bf P^{(i)}}^{-1} N
- \Lambda_X^{(i)} -{{\bf K}} P^{(i)} \right)$ (625)
  $\textstyle =$ $\displaystyle P^{(i)} + \eta {\bf A}^{-1} \left(
{\bf P^{(i)}}^{-1} N - N_X - {\bf I}_X {\bf P}^{(i)} {{\bf K}} P^{(i)}
-{{\bf K}} P^{(i)} \right).$ (626)

Again, normalizing $P$ at each iteration this is equivalent to solving for $z = Z_X P$, and the update procedure for $\Lambda_X$ can be varied.


next up previous contents
Next: Learning matrices Up: Iteration procedures: Learning Previous: Iteration procedures: Learning   Contents
Joerg_Lemm 2001-01-21