next up previous contents
Next: Gaussian relaxation Up: Learning matrices Previous: Linearization and Newton algorithm   Contents


Massive relaxation

We now consider methods to construct a positive definite or at least invertible learning matrix. For example, far from a minimum the Hessian ${\bf H}$ may not be positive definite and like a differential operator ${{\bf K}}$ with zero modes, not even invertible. Massive relaxation can transform a non-invertible or not positive definite operator ${\bf A}_0$, e.g. ${\bf A}_0 = {{\bf K}}$ or ${\bf A}_0 = - {\bf H}$, into an invertible or positive definite operators:

$\displaystyle {\bf A} = {\bf A}_0 + m^2 {\bf I}$ $\textstyle :$ $\displaystyle {\rm Massive \,\, relaxation}$ (638)

A generalization would be to allow $m = m(x,y)$. This is, for example, used in some realizations of Newton`s method for minimization in regions where ${\bf H}$ is not positive definite and a diagonal operator is added to $-{\bf H}$, using for example a modified Cholesky factorization [19]. The mass term removes the zero modes of ${{\bf K}}$ if $-m^2$ is not in the spectrum of ${\bf A}_0$ and makes it positive definite if $m^2$ is larger than the smallest eigenvalue of ${\bf A}_0$. Matrix elements $(\, \phi,\, ({\bf A}_0 - z {\bf I})^{-1} \,\phi\,)$ of the resolvent ${\bf A}^{-1} (z)$, $z=-m^2$ representing in this case a complex variable, have poles at discrete eigenvalues of ${\bf A}_0$ and a cut at the continuous spectrum as long as $\phi $ has a non-zero overlap with the corresponding eigenfunctions. Instead of multiples of the identity, also other operators may be added to remove zero modes. The Hessian ${\bf H}_L$ in (156), for example, adds a $x$-dependent mass-like, but not necessarily positive definite term to ${{\bf K}}$. Similarly, for example ${\bf H}_P$ in (182) has $(x,y)$-dependent mass ${\bf P}^{-2} {\bf N}$ restricted to data points.

While full relaxation is the massless limit $m^2\rightarrow 0$ of massive relaxation, a gradient algorithm with $\eta^\prime $ can be obtained as infinite mass limit $m^2\rightarrow \infty$ with $\eta \rightarrow \infty$ and $m^2/\eta = 1/\eta^\prime$.

Constant functions are typical zero modes, i.e., eigenfunctions with zero eigenvalue, for differential operators with periodic boundary conditions. For instance for a common smoothness term $-\Delta$ (kinetic energy operator) as regularizing operator ${{\bf K}}$ the inverse of ${\bf A} = {{\bf K}} +m^2 {\bf I}$ has the form

\begin{displaymath}
{\bf A}^{-1} (x^\prime , y^\prime ; x,y ) =
\, \, \frac{1}{-\Delta + m^2}.
\end{displaymath} (639)


\begin{displaymath}
=\int_{-\infty}^\infty \! \frac{d^{d_X}\!k_x \, d^{d_Y}\!k_y...
...i k_x (x-x^\prime) + ik_y (y-y^\prime )}
}{k_x^2 + k_y^2+m^2},
\end{displaymath} (640)

with $d=d_X+d_Y$, $d_X$ = dim($X$), $d_Y$ = dim($Y$). This Green`s function or matrix element of the resolvent kernel for ${\bf A}_0$ is analogous to the (Euclidean) propagator of a free scalar field with mass $m$, which is its two-point correlation function or matrix element of the covariance operator. According to $1/x = \int_0^\infty \! dt\, e^{-xt}$ the denominator can be brought into the exponent by introducing an additional integral. Performing the resulting Gaussian integration over $k = (k_x,k_y)$ the inverse can be expressed as

\begin{displaymath}
{\bf A}^{-1} (x^\prime , y^\prime ; x,y ; m)
=
m^{d-2} {\bf A}^{-1} (m (x-x^\prime) , m (y-y^\prime) ; 1)
\end{displaymath}


\begin{displaymath}
=(2 \pi)^{-d/2} \left( \frac{m}{\vert x-x^\prime\vert+\vert ...
...{(d-2)/2}( m \vert x-x^\prime\vert+ m \vert y-y^\prime\vert ),
\end{displaymath} (641)

in terms of the modified Bessel functions $K_\nu (x)$ which have the following integral representation
\begin{displaymath}
K_\nu (2 \sqrt{\beta \gamma}) =
\frac{1}{2} \left(\frac{\gam...
...
\int_0^\infty \! dt\, t^{\nu-1} e^{\frac{\beta}{t}-\gamma t}.
\end{displaymath} (642)

Alternatively, the same result can be obtained by switching to $d$-dimensional spherical coordinates, expanding the exponential in ultra-spheric harmonic functions and performing the integration over the angle-variables [117]. For the example $d=1$ this corresponds to Parzenīs kernel used in density estimation or for $d=3$
\begin{displaymath}
{\bf A}^{-1} (x^\prime , y^\prime ; x,y )
= \frac{1}{4 \pi ...
...ert}
e^{- m \vert x-x^\prime\vert - m \vert y-y^\prime\vert }.
\end{displaymath} (643)

The Green's function for periodic, Neumann, or Dirichlet boundary conditions can be expressed by sums over ${\bf A}^{-1} (x^\prime , y^\prime ; x,y )$ [77].

The lattice version of the Laplacian with lattice spacing $a$ reads

\begin{displaymath}
\hat \Delta f(n) =
\frac{1}{a^2} \sum_j^d
[ f(n-a_j)-2 f(n) + f(n+a_j) ],
\end{displaymath} (644)

writing $a_j$ for a vector in direction $j$ and length $a$. Including a mass term we get as lattice approximation for ${\bf A}$

\begin{displaymath}
\hat{\bf A} (n_x,n_y ; m_x,m_y ) =
-\frac{1}{a^2}\sum_{i=1}^...
...{n_x+a_i^x,m_x}
-2 \delta_{n_x,m_x}
+ \delta_{n_x-a^x_i,m_x})
\end{displaymath}


\begin{displaymath}
-\frac{1}{a^2}\sum_{j=1}^{d_Y}
\delta_{n_x,m_x}
(\delta_{n_...
...\delta_{n_y-a^y_j,m_y})
+m^2 \delta_{n_x,m_x} \delta_{n_y,m_y}
\end{displaymath} (645)

Inserting the Fourier representation (101) of $\delta (x)$ gives

\begin{displaymath}
\hat{\bf A} (n_x,n_y ; m_x,m_y )
= \frac{2d}{a^2}
\int_{-\p...
...\, d^{d_Y}\!k_y}{(2\pi)^d}
e^{ik_x (n_x-m_x) + ik_y (n_y-m_y)}
\end{displaymath}


\begin{displaymath}
\times \left(
1 + \frac{m^2 a^2}{2d}
-\frac{1}{d} \sum_{i=1}...
...s k_{x,i}
-\frac{1}{d} \sum_{j=1}^{d_Y} \cos k_{y,j}
\right),
\end{displaymath} (646)

with $k_{x,i}$ = $k_x a^x_i $, $\cos k_{y,j}$ = $\cos k_y a^y_j$ and inverse

\begin{displaymath}
\hat {\bf A}^{-1} (n_x , n_y ; m_x,m_y) =
\int_{-\pi}^\pi \!...
...t {\bf A}^{-1} (k_x , k_y)
e^{ik_x (n_x-m_x) + ik_y (n_y-m_y)}
\end{displaymath}


\begin{displaymath}
=\frac{a^2}{2d}
\int_{-\pi}^\pi \!\!
\frac{d^{d_X}\!k_x \, ...
...os k_{x,i}
\!-\!
\frac{1}{d} \sum_{j=1}^{d_Y} \cos k_{y,j}
}.
\end{displaymath} (647)

(For $m=0$ and $d\le 2$ the integrand diverges for $k\rightarrow 0$ (infrared divergence). Subtracting formally the also infinite $\hat {\bf A}^{-1} (0,0 ; 0,0)$ results in finite difference. For example in $d=1$ one finds $\hat {\bf A}^{-1} (n_y ; m_y) -
\hat {\bf A}^{-1} (0;0)$ = $-(1/2) \vert n_y-m_y \vert$ [103]. Using $1/x = \int_0^\infty \! dt\, e^{-xt}$ one obtains [195]
\begin{displaymath}
\hat {\bf A}^{-1} (k_x , k_y) =
\frac{1}{2} \int_0^\infty \!...
...um_i^{d_X} \cos k_{x,i} + \sum_j^{d_Y} \cos k_{y,j} \right) },
\end{displaymath} (648)

with $\mu = d/a^2 + m^2/2$. This allows to express the inverse $\hat {\bf A}^{-1}$ in terms of the modified Bessel functions $I_\nu (n)$ which have for integer argument $n$ the integral representation
\begin{displaymath}
I_\nu (n) =
\frac{1}{\pi} \int_0^\pi \!d\Theta\, e^{n \cos \Theta}\cos (\nu \Theta).
\end{displaymath} (649)

One finds
\begin{displaymath}
\hat {\bf A}^{-1} (n_x , n_y ; m_x,m_y) =
\frac{1}{2} \int_0...
..._{j =1}^{d_Y} K_{\vert m_{y,j} - m^\prime_{y,j}\vert} (t/a^2).
\end{displaymath} (650)

It might be interesting to remark that the matrix elements of the inverse learning matrix or free massive propagator on the lattice $\hat {\bf A}^{-1}(x^\prime,y^\prime;x,y)$ can be given an interpretation in terms of (random) walks connecting the two points $(x^\prime, y^\prime)$ and $(x,y)$ [56,195]. For that purpose the lattice Laplacian is splitted into a diagonal and a nearest neighbor part

\begin{displaymath}
- \hat \Delta = \frac{1}{a^2} \left( 2d {\bf I} - {\bf W} \right),
\end{displaymath} (651)

where the nearest neighbor matrix ${\bf W}$ has matrix elements equal one for nearest neighbors and equal to zero otherwise. Thus,
\begin{displaymath}
\left( - \hat \Delta + m^2\right)^{-1}
=
\frac{1}{2 \mu} \le...
...um_{n=0}^\infty
\left(\frac{1}{2 \mu a^2}\right)^n {\bf W}^n ,
\end{displaymath} (652)

can be written as geometric series. The matrix elements ${\bf W}^n (x^\prime,y^\prime;x,y)$ give the number of walks $w[(x^\prime,y^\prime)\rightarrow (x,y)]$ of length $\vert w\vert$ = $n$ connecting the two points $(x^\prime, y^\prime)$ and $(x,y)$. Thus, one can write
\begin{displaymath}
\left( - \hat \Delta + m^2\right)^{-1}(x^\prime,y^\prime;x,y...
...rrow (x,y)]}
\left(\frac{1}{2 \mu a^2}\right)^{\vert w\vert} .
\end{displaymath} (653)


next up previous contents
Next: Gaussian relaxation Up: Learning matrices Previous: Linearization and Newton algorithm   Contents
Joerg_Lemm 2001-01-21