Maximum likelihood approximation

Next: Prior models for potentials Up: The likelihood model of Previous: Likelihood in the canonical Contents

Maximum likelihood approximation

A maximum likelihood approach selects the potential with maximal likelihood $p(x_T\vert\hat x,v)$ under the training data. Beginning with a discussion of the parametric approach we consider a potential $v(\xi,x)$ parameterized by a parameter vector $\xi$ with components $\xi_l$ . To find the parameter vector which maximizes the training likelihood we have to solve the stationarity equation

$\begin{displaymath} 0= \partial_{\xi} p(x_T\vert\hat x,v) , \end{displaymath}$

(21)

$\partial_{\xi}$ = $\partial/\partial \xi$ denoting the gradient operator with components $\partial_{\xi_l}$ . Obtaining from Eq. (20)

$\displaystyle \partial_{\xi} p(x_i\vert\hat x,v)$	$\textstyle =$	$\displaystyle <\big(\partial_{\xi} \phi^(x_i)\big)\phi (x_i)> +<\phi^(x_i) \big(\partial_{\xi} \phi (x_i)\big)>$	(22)
	$\textstyle -$	$\displaystyle \beta\left( <\vert\phi (x_i)\vert^2\partial_{\xi} E> -<\vert\phi (x_i)\vert^2><\partial_{\xi} E > \right) ,$

we see that to solve Eq. (21) we have to calculate the derivatives of the eigenvalues $\partial_\xi E_\alpha$ and of the eigenfunctions at the data points $\partial_\xi \phi_\alpha (x_i)$ . Those are implicitly defined by the eigenvalue equation for

. To proceed we take the derivative of the eigenvalue equation (18)

$\begin{displaymath} \left(\partial_\xi H\right) \mbox{$\vert\,\phi_\alpha\!>$} +... ...\!>$} + E_\alpha \mbox{$\vert\,\partial_\xi \phi_\alpha\!>$} . \end{displaymath}$

(23)

Projecting onto $\mbox{$<\!\phi_\alpha\,\vert$}$ yields, using $\partial_\xi H$ = $\partial_\xi v$ and the hermitian conjugate of Eq. (18) we arrive at

$\displaystyle \partial_\xi E_\alpha$	$\textstyle =$	$\displaystyle \frac{<\!\phi_\alpha\,\vert\,\partial_\xi v\,\vert\,\phi_\alpha\!>} {\mbox{$<\!\phi_\alpha\,\vert\,\phi_\alpha\!>$}} ,$	(24)
$\displaystyle (E_\alpha-H) \mbox{$\vert\,\partial_\xi \phi_\alpha\!>$}$	$\textstyle =$	$\displaystyle \left( \partial_\xi v -\partial_\xi E_\alpha \right) \mbox{$\vert\,\phi_\alpha \!>$} .$	(25)

Because all orbitals with energy $E_\alpha$ (which may be more than one if $E_\alpha$ is degenerated) are in the null space of the operator $(E_\alpha-H)$ , Eq. (25) alone does not determine $\partial_\xi \phi_\alpha$ uniquely. We also notice, that because the left hand side of Eq. (25) vanishes if projected on a eigenfunction $\phi_\gamma$ with $E_\gamma$ = $E_\alpha$ we find for degenerate eigenfunctions $<\!\phi_\gamma\,\vert\,\partial_\xi H\,\vert\,\phi_\alpha\!>$ = 0, if we choose $\mbox{$<\!\phi_\alpha\,\vert\,\phi_\gamma\!>$}$ = $\delta_{\alpha,\gamma}$ . A unique solution for $\partial_\xi \phi_\alpha$ can be obtained be setting $\mbox{$<\!\phi_\gamma\,\vert\,\partial_\xi \phi_\alpha\!>$}$ = 0 for eigenfunctions $\phi_\gamma$ with $E_\alpha$ = $E_\gamma$ . This corresponds to fixing normalization and phase of eigenfunctions and, in case of degenerate eigenvalues, uses the freedom to work with arbitrary, orthonormal linear combinations of the corresponding eigenfunctions. Because the operator $(E_\alpha-H)$ is invertible in the space spanned by all eigenfunctions $\phi_\gamma$ with different energy $E_\gamma\ne E_\alpha$ , this yields, using orthonormal eigenfunctions,

$\begin{displaymath} \mbox{$\vert\,\partial_\xi \phi_\alpha\!>$} = \sum_{\gamma\a... ...<\!\phi_\gamma\,\vert\,\partial_\xi v\,\vert\,\phi_\alpha\!> . \end{displaymath}$

(26)

For nondegenerated energies the sum becomes $\sum_{\gamma\ne \alpha}$ . The stationarity equation (21) can now be solved iteratively by starting from an initial guess

for

, calculating $E_\alpha(v)$ and $\phi_\alpha(v)$ to obtain $\partial_\xi E_\alpha$ and $\partial_\xi \phi_\alpha (x_i)$ from Eqs. (24,25) and thus $\partial_\xi p(x_i\vert\hat x,v)$ from Eq. (22). Then a new guess for

is calculated (switching to log-likelihoods)

$\begin{displaymath} v^{\rm new} = v^{\rm old}+\eta A^{-1} \sum_i \partial_\xi \ln p(x_i\vert\hat x,v^{\rm old}) , \end{displaymath}$

(27)

with some step width $\eta$ and some positive definite operator

(approximating for example the Hessian of $\ln p(x_T\vert\hat x,v)$ ). This procedure is now iterated till convergence.

While a parametric approach restricts the space of possible potentials , a nonparametric approach treats each function value itself as individual degree of freedom, not restricting the space of potentials. The corresponding nonparametric stationarity equation is obtained analogous to the parametric stationarity equation (21) replacing partial derivatives $\partial_\xi$ with the functional derivative operator $\delta_v$ = $\delta/\delta v$ with components $\delta_{v(x)}$ = $\delta/\left(\delta v(x)\right)$ [59]. Because the functional derivative of is simply

$\begin{displaymath} \delta_{v(x)} H (x^\prime,x^{\prime\prime}) = \delta_{v(x)} ... ...}) = \delta (x-x^\prime) \delta (x^\prime-x^{\prime\prime}) , \end{displaymath}$

(28)

we get, using the same arguments leading to Eq. (26)

$\displaystyle \delta_{v(x)} E_\alpha$	$\textstyle =$	$\displaystyle \frac{<\!\phi_\alpha\,\vert\,\delta_{v(x)} H\,\vert\,\phi_\alpha\!>}{\mbox{$<\!\phi_\alpha\,\vert\,\phi_\alpha\!>$}} = \vert\phi_\alpha(x)\vert^2 ,$	(29)
$\displaystyle \delta_{v(x)} \phi_\alpha(x^{\prime})$	$\textstyle =$	$\displaystyle \sum_{\gamma\atop E_\gamma\ne E_\alpha} \frac{1}{E_\alpha-E_\gamma} \, \phi_\gamma(x^{\prime}) \phi^*_\gamma(x) \phi_\alpha (x) ,$	(30)

and therefore

$\displaystyle \delta_{v(x)}p(x_i\vert\hat x,v)$	$\textstyle =$	$\displaystyle <\left(\delta_{v(x)}\phi^(x_i)\right) \phi (x_i)> +<\phi^(x_i)\delta_{v(x)}\phi (x_i)>$	(31)
	$\textstyle -$	$\displaystyle \beta \left( < \vert\phi (x_i)\vert^2 \vert\phi (x)\vert^2 > -< \vert\phi (x_i)\vert^2>< \vert\phi (x)\vert^2> \right) .$

(The partial derivative with respect to parameters $\xi$ and the functional derivative with respect to

are related according to the chain rule $\partial_{\xi_l} p(x_i\vert\hat x,v)$ = $\int\!dx\, (\partial_{\xi_l} v(x)) \, \delta_{v(x)} p(x_i\vert\hat x,v)$ = ${\bf v}_\xi \delta_{v} p(x_i\vert\hat x,v)$ with operator ${\bf v}_\xi (l,x)$ = $\partial_{\xi_l} v(\xi,x)$ .)

The large flexibility of the nonparametric approach allows an optimal adaption of to the available training data. However, as it is well known in the context of learning it is the same flexibility which makes a satisfactory generalization to non-training data (e.g., in the future) impossible, leading, for example, to `pathological', $\delta$ -functional like solutions. Nonparametric approaches require therefore additional restrictions in form of a priori information. In the next section we will include a priori information in form of stochastic processes, similarly to Bayesian statistics with Gaussian processes [16, 37, 44, 48, 60-63] or to classical regularization theory [2,4,16]. In particular, a priori information will be implemented explicitly, by which we mean it will be expressed directly in terms of the function values itself. This is a great advantage over parametric methods where a priori information is implicit in the chosen parameterization, thus typically difficult or impossible to analyze and not easily adapted to the situation under study. Indeed, because it is only a priori knowledge which relates training to non-training data, its explicit control is essential for any successful learning.

Next: Prior models for potentials Up: The likelihood model of Previous: Likelihood in the canonical Contents

Joerg_Lemm 2000-06-06