next up previous contents
Next: Parameterizing priors: Hyperparameters Up: Parameterizing likelihoods: Variational methods Previous: Projection pursuit   Contents


Neural networks

While in projection pursuit-like techniques the one-dimensional `ridge' functions $\phi_l$ are adapted optimally, neural networks use ridge functions of a fixed sigmoidal form. The resulting lower flexibility following from fixing the ridge function is then compensated by iterating this parameterization. This leads to multilayer neural networks.

Multilayer neural networks have been become a popular tool for regression and classification problems [205,124,159,96,167,231,24,200,10]. One-layer neural networks, also known as perceptrons, correspond to the parameterization

\begin{displaymath}
\phi (z) = \sigma \left(\sum_l w_l z_l-b\right)
=\sigma (v)
,
\end{displaymath} (411)

with a sigmoidal function $\sigma$, parameters $\xi$ = $w$, projection $v = \sum_l w_l z_l-b$ and $z_l$ single components of the variables $x$, $y$, i.e., $z_{l}$ = $x_l$ for $1\le l \le d_x$ and $z_{l}$ = $y_l$ for $d_x+1\le l\le d_x+d_y$. (For neural networks with Lorentzians instead of sigmoids see [72].)

Typical choices for the sigmoid are $\sigma (v)$ = $\tanh (\beta v)$ or $\sigma (v)$ = $1/(1+e^{-2\beta v})$. The parameter $\beta $, often called inverse temperature, controls the sharpness of the step of the sigmoid. In particular, the sigmoid functions become a sharp step in the limit $\beta\rightarrow \infty$, i.e., at zero temperature. In principle the sigmoidal function $\sigma$ may depend on further parameters which then -- similar to projection pursuit discussed in Section 4.8 -- would also have to be included in the optimization process. The threshold or bias $b$ can be treated as weight if an additional input component is included clamped to the value $1$.

A linear combination of perceptrons

\begin{displaymath}
\phi (x,y) = b+\sum_l W_l \sigma \left(\sum_k w_{lk} z_k-b_k\right)
,
\end{displaymath} (412)

has the form of a projection pursuit approach (408) but with fixed $\phi_l(v)$ = $W_l \sigma(v)$.

In multi-layer networks the parameterization (411) is cascaded,

\begin{displaymath}
z_{k,i}
= \sigma \left(\sum_{l=1}^{m_{i-1}} w_{kl,i} z_{l,i-1}-b_{k,i})\right)
= \sigma (v_{k,i}),
\end{displaymath} (413)

with $z_{k,i}$ representing the output of the $k$th node (neuron) in layer $i$ and
\begin{displaymath}
v_{k,i} = \sum_{l=1}^{m_{i-1}} w_{kl,i} z_{l,i-1} -b_{k,i}
,
\end{displaymath} (414)

being the input for that node. This yields, skipping the bias terms for simplicity
\begin{displaymath}
\phi(z,w) =
\sigma\left(\sum_{l_{n-1}}^{m_{n-1}} w_{l_{n-1},...
... w_{l_{1}l_{0},1} z_{l_{0},0}
\right)
\cdots
\right)
\right)
,
\end{displaymath} (415)

beginning with an input layer with $m_0$ = $d_x+d_y$ nodes (plus possibly nodes to implement the bias) $z_{l,0}$ = $z_l$ and going over intermediate layers with $m_i$ nodes $z_{l,i}$, $0<i<n$, $1\le l\le m_i$ to a single node output layer $z_{n}$ = $\phi(x,y)$.

Commonly neural nets are used in regression and classification to parameterize a function $\phi(x,y)$ = $h(x)$ in functionals

\begin{displaymath}
E=\sum_i (y_i-{h}(x_i,w))^2
,
\end{displaymath} (416)

quadratic in ${h}$ and without further regularization terms. In that case, regularization has to be assured by using either 1. a neural network architecture which is restrictive enough, 2. by using early stopping like training procedures so the full flexibility of the network structure cannot completely develop and destroy generalization, where in both cases the optimal architecture or algorithm can be determined for example by cross-validation or bootstrap techniques [166,6,230,216,217,81,39,228,54], or 3. by averaging over ensembles of networks [170]. In all these cases regularization is implicit in the parameterization of the network. Alternatively, explicit regularization or prior terms can be added to the functional. For regression or classification this is for example done in learning by hints [2,3,4] or curvature-driven smoothing with feedforward networks [22].

One may also remark that from a Frequentist point of view the quadratic functional is not interpreted as posterior but as squared-error loss $\sum_i (y_i-a(x_i,w))^2$ for actions $a(x) = a(x,w)$. According to Section 2.2.2 minimization of error functional (416) for data $\{(x_i,y_i)\vert 1\le i\le n\}$ sampled under the true density $p(x,y\vert f)$ yields therefore an empirical estimate for the regression function $\int \!dy \,y\, p(y\vert x,f)$.

We consider here neural nets as parameterizations for density estimation with prior (and normalization) terms explicitly included in the functional $E_\phi$. In particular, the stationarity equation for functional (352) becomes

\begin{displaymath}
0 = \Phi_w^\prime {\bf P}^\prime {\bf P}^{-1} N
-\Phi_w^\prime {{\bf K}} \phi
-\Phi_w^\prime {\bf P}^\prime \Lambda_X
,
\end{displaymath} (417)

with matrix of derivatives
$\displaystyle \Phi^\prime_{w} (k,l,i; x,y)$ $\textstyle =$ $\displaystyle \frac{\partial \phi (x,y,w)}{\partial w_{kl,i}}$ (418)
  $\textstyle =$ $\displaystyle \sigma^\prime (v_n) \sum_{l_{n-1}} w_{l_{n-1},n}
\sigma^\prime (v_{l_{n-1},n-1}) \sum_{l_{n-2}} w_{l_{n-1}l_{n-2},n-1}$  
    $\displaystyle \cdots\;
\sum_{l_{i+1}} w_{l_{i+2}l_{i+1},i+2}
\sigma^\prime (v_{l_{i+1},i+1})
w_{l_{i+1}k,i+1}
\sigma^\prime (v_{l_{i},i}) z_{l,i-1},$  

and $\sigma^\prime (v)$ = $d\sigma(v)/dv$. While $\phi (x,y,w)$ is calculated by forward propagating $z$ = $(x,y)$ through the net defined by weight vector $w$ according to Eq. (415) the derivatives $\Phi^\prime$ can efficiently be calculated by back-propagation according to Eq. (418). Notice that even for diagonal ${\bf P}^\prime$ the derivatives are not needed only at data points but the prior and normalization term require derivatives at all $x$, $y$. Thus, in practice terms like $\Phi^\prime {\bf K} \phi$ have to be calculated in a relatively poor discretization. Notice, however, that regularization is here not only due to the prior term but follows also from the restrictions implicit in a chosen neural network architecture. In many practical cases a relatively poor discretization of the prior term may thus be sufficient.

Table 6 summarizes the discussed approaches.


Table 6: Some possible parameterizations.
Ansatz Functional form to be optimized
linear ansatz $\phi(z) = \sum_l \xi_l B_l(z)$ $\xi_l$
linear model $\phi(z) = \xi_0 + \sum_l \xi_l z_l$ $\xi_0$, $\xi_l$
$\;$with interaction $\qquad\quad + \sum_{mn} \xi_{mn} z_m z_n +\cdots$ $\xi_{mn},\cdots$
mixture model $\phi(z) = \sum \xi_{0,l} B_l(\xi_l,z)$ $\xi_{0,l}$, $\xi_l$
additive model $\phi(z) = \sum_l\phi_l(z_l)$ $\phi_l(z_l)$
$\;$with interaction $ \qquad\quad+ \sum_{mn} \phi_{mn}(z_m z_n) +\cdots$ $\phi_{mn}(z_mz_n),\cdots$
product ansatz $\phi(z) = \prod_l\phi_l(z_l)$ $\phi_l(z_l)$
decision trees $\phi(z) = \sum_l \xi_{l} \prod_k \Theta(z_{\xi_{lk}}-\xi_{0,lk})$ $\xi_l$, $\xi_{0,lk}$, $\xi_{lk}$
projection pursuit $\phi(z)=\xi_0+\sum_l \phi_l(\xi_{0,l}+\sum_l\xi_l z_l)$ $\phi_l$, $\xi_0$, $\xi_{0,l}$, $\xi_l$
neural net (2 lay.) $\phi(z)=\sigma\!\left(\sum_{l} \xi_{l}\,
\sigma\!\left(\sum_k \xi_{lk} z_k\right)\right)$ $\xi_l$, $\xi_{lk}$



next up previous contents
Next: Parameterizing priors: Hyperparameters Up: Parameterizing likelihoods: Variational methods Previous: Projection pursuit   Contents
Joerg_Lemm 2001-01-21