Massive relaxation

Next: Gaussian relaxation Up: Learning matrices Previous: Linearization and Newton algorithm Contents

Massive relaxation

We now consider methods to construct a positive definite or at least invertible learning matrix. For example, far from a minimum the Hessian ${\bf H}$ may not be positive definite and like a differential operator ${{\bf K}}$ with zero modes, not even invertible. Massive relaxation can transform a non-invertible or not positive definite operator ${\bf A}_0$ , e.g. ${\bf A}_0 = {{\bf K}}$ or ${\bf A}_0 = - {\bf H}$ , into an invertible or positive definite operators:

$\displaystyle {\bf A} = {\bf A}_0 + m^2 {\bf I}$

$\textstyle :$

$\displaystyle {\rm Massive \,\, relaxation}$

(638)

A generalization would be to allow

. This is, for example, used in some realizations of Newton`s method for minimization in regions where ${\bf H}$ is not positive definite and a diagonal operator is added to $-{\bf H}$ , using for example a modified Cholesky factorization [19]. The mass term removes the zero modes of ${{\bf K}}$ if

is not in the spectrum of ${\bf A}_0$ and makes it positive definite if

is larger than the smallest eigenvalue of ${\bf A}_0$ . Matrix elements $(\, \phi,\, ({\bf A}_0 - z {\bf I})^{-1} \,\phi\,)$ of the resolvent ${\bf A}^{-1} (z)$ ,

representing in this case a complex variable, have poles at discrete eigenvalues of ${\bf A}_0$ and a cut at the continuous spectrum as long as $\phi$ has a non-zero overlap with the corresponding eigenfunctions. Instead of multiples of the identity, also other operators may be added to remove zero modes. The Hessian ${\bf H}_L$ in (156), for example, adds a

-dependent mass-like, but not necessarily positive definite term to ${{\bf K}}$ . Similarly, for example ${\bf H}_P$ in (182) has

-dependent mass ${\bf P}^{-2} {\bf N}$ restricted to data points.

While full relaxation is the massless limit $m^2\rightarrow 0$ of massive relaxation, a gradient algorithm with $\eta^\prime$ can be obtained as infinite mass limit $m^2\rightarrow \infty$ with $\eta \rightarrow \infty$ and $m^2/\eta = 1/\eta^\prime$ .

Constant functions are typical zero modes, i.e., eigenfunctions with zero eigenvalue, for differential operators with periodic boundary conditions. For instance for a common smoothness term $-\Delta$ (kinetic energy operator) as regularizing operator ${{\bf K}}$ the inverse of ${\bf A} = {{\bf K}} +m^2 {\bf I}$ has the form

$\begin{displaymath} {\bf A}^{-1} (x^\prime , y^\prime ; x,y ) = \, \, \frac{1}{-\Delta + m^2}. \end{displaymath}$

(639)

$\begin{displaymath} =\int_{-\infty}^\infty \! \frac{d^{d_X}\!k_x \, d^{d_Y}\!k_y... ...i k_x (x-x^\prime) + ik_y (y-y^\prime )} }{k_x^2 + k_y^2+m^2}, \end{displaymath}$

(640)

with

= dim(

). This Green`s function or matrix element of the resolvent kernel for ${\bf A}_0$ is analogous to the (Euclidean) propagator of a free scalar field with mass

, which is its two-point correlation function or matrix element of the covariance operator. According to $1/x = \int_0^\infty \! dt\, e^{-xt}$ the denominator can be brought into the exponent by introducing an additional integral. Performing the resulting Gaussian integration over

the inverse can be expressed as

$\begin{displaymath} {\bf A}^{-1} (x^\prime , y^\prime ; x,y ; m) = m^{d-2} {\bf A}^{-1} (m (x-x^\prime) , m (y-y^\prime) ; 1) \end{displaymath}$

$\begin{displaymath} =(2 \pi)^{-d/2} \left( \frac{m}{\vert x-x^\prime\vert+\vert ... ...{(d-2)/2}( m \vert x-x^\prime\vert+ m \vert y-y^\prime\vert ), \end{displaymath}$

(641)

in terms of the modified Bessel functions $K_\nu (x)$ which have the following integral representation

$\begin{displaymath} K_\nu (2 \sqrt{\beta \gamma}) = \frac{1}{2} \left(\frac{\gam... ... \int_0^\infty \! dt\, t^{\nu-1} e^{\frac{\beta}{t}-\gamma t}. \end{displaymath}$

(642)

Alternatively, the same result can be obtained by switching to

-dimensional spherical coordinates, expanding the exponential in ultra-spheric harmonic functions and performing the integration over the angle-variables [117]. For the example

this corresponds to Parzen´s kernel used in density estimation or for

$\begin{displaymath} {\bf A}^{-1} (x^\prime , y^\prime ; x,y ) = \frac{1}{4 \pi ... ...ert} e^{- m \vert x-x^\prime\vert - m \vert y-y^\prime\vert }. \end{displaymath}$

(643)

The Green's function for periodic, Neumann, or Dirichlet boundary conditions can be expressed by sums over ${\bf A}^{-1} (x^\prime , y^\prime ; x,y )$ [77].

The lattice version of the Laplacian with lattice spacing reads

$\begin{displaymath} \hat \Delta f(n) = \frac{1}{a^2} \sum_j^d [ f(n-a_j)-2 f(n) + f(n+a_j) ], \end{displaymath}$

(644)

writing

for a vector in direction

and length

. Including a mass term we get as lattice approximation for ${\bf A}$

$\begin{displaymath} \hat{\bf A} (n_x,n_y ; m_x,m_y ) = -\frac{1}{a^2}\sum_{i=1}^... ...{n_x+a_i^x,m_x} -2 \delta_{n_x,m_x} + \delta_{n_x-a^x_i,m_x}) \end{displaymath}$

$\begin{displaymath} -\frac{1}{a^2}\sum_{j=1}^{d_Y} \delta_{n_x,m_x} (\delta_{n_... ...\delta_{n_y-a^y_j,m_y}) +m^2 \delta_{n_x,m_x} \delta_{n_y,m_y} \end{displaymath}$

(645)

Inserting the Fourier representation (101) of $\delta (x)$ gives

$\begin{displaymath} \hat{\bf A} (n_x,n_y ; m_x,m_y ) = \frac{2d}{a^2} \int_{-\p... ...\, d^{d_Y}\!k_y}{(2\pi)^d} e^{ik_x (n_x-m_x) + ik_y (n_y-m_y)} \end{displaymath}$

$\begin{displaymath} \times \left( 1 + \frac{m^2 a^2}{2d} -\frac{1}{d} \sum_{i=1}... ...s k_{x,i} -\frac{1}{d} \sum_{j=1}^{d_Y} \cos k_{y,j} \right), \end{displaymath}$

(646)

with $k_{x,i}$ =

, $\cos k_{y,j}$ = $\cos k_y a^y_j$ and inverse

$\begin{displaymath} \hat {\bf A}^{-1} (n_x , n_y ; m_x,m_y) = \int_{-\pi}^\pi \!... ...t {\bf A}^{-1} (k_x , k_y) e^{ik_x (n_x-m_x) + ik_y (n_y-m_y)} \end{displaymath}$

$\begin{displaymath} =\frac{a^2}{2d} \int_{-\pi}^\pi \!\! \frac{d^{d_X}\!k_x \, ... ...os k_{x,i} \!-\! \frac{1}{d} \sum_{j=1}^{d_Y} \cos k_{y,j} }. \end{displaymath}$

(647)

(For

and $d\le 2$ the integrand diverges for $k\rightarrow 0$ (infrared divergence). Subtracting formally the also infinite $\hat {\bf A}^{-1} (0,0 ; 0,0)$ results in finite difference. For example in

one finds $\hat {\bf A}^{-1} (n_y ; m_y) - \hat {\bf A}^{-1} (0;0)$ = $-(1/2) \vert n_y-m_y \vert$ [103]. Using $1/x = \int_0^\infty \! dt\, e^{-xt}$ one obtains [195]

$\begin{displaymath} \hat {\bf A}^{-1} (k_x , k_y) = \frac{1}{2} \int_0^\infty \!... ...um_i^{d_X} \cos k_{x,i} + \sum_j^{d_Y} \cos k_{y,j} \right) }, \end{displaymath}$

(648)

with $\mu = d/a^2 + m^2/2$ . This allows to express the inverse $\hat {\bf A}^{-1}$ in terms of the modified Bessel functions $I_\nu (n)$ which have for integer argument

the integral representation

$\begin{displaymath} I_\nu (n) = \frac{1}{\pi} \int_0^\pi \!d\Theta\, e^{n \cos \Theta}\cos (\nu \Theta). \end{displaymath}$

(649)

One finds

$\begin{displaymath} \hat {\bf A}^{-1} (n_x , n_y ; m_x,m_y) = \frac{1}{2} \int_0... ..._{j =1}^{d_Y} K_{\vert m_{y,j} - m^\prime_{y,j}\vert} (t/a^2). \end{displaymath}$

(650)

It might be interesting to remark that the matrix elements of the inverse learning matrix or free massive propagator on the lattice $\hat {\bf A}^{-1}(x^\prime,y^\prime;x,y)$ can be given an interpretation in terms of (random) walks connecting the two points $(x^\prime, y^\prime)$ and [56,195]. For that purpose the lattice Laplacian is splitted into a diagonal and a nearest neighbor part

$\begin{displaymath} - \hat \Delta = \frac{1}{a^2} \left( 2d {\bf I} - {\bf W} \right), \end{displaymath}$

(651)

where the nearest neighbor matrix ${\bf W}$ has matrix elements equal one for nearest neighbors and equal to zero otherwise. Thus,

$\begin{displaymath} \left( - \hat \Delta + m^2\right)^{-1} = \frac{1}{2 \mu} \le... ...um_{n=0}^\infty \left(\frac{1}{2 \mu a^2}\right)^n {\bf W}^n , \end{displaymath}$

(652)

can be written as geometric series. The matrix elements ${\bf W}^n (x^\prime,y^\prime;x,y)$ give the number of walks $w[(x^\prime,y^\prime)\rightarrow (x,y)]$ of length $\vert w\vert$ =

connecting the two points $(x^\prime, y^\prime)$ and

. Thus, one can write

$\begin{displaymath} \left( - \hat \Delta + m^2\right)^{-1}(x^\prime,y^\prime;x,y... ...rrow (x,y)]} \left(\frac{1}{2 \mu a^2}\right)^{\vert w\vert} . \end{displaymath}$

(653)

Next: Gaussian relaxation Up: Learning matrices Previous: Linearization and Newton algorithm Contents

Joerg_Lemm 2001-01-21