Density estimation with Gaussian mixture prior

Next: Bibliography Up: Numerical examples Previous: Density estimation with Gaussian Contents

Density estimation with Gaussian mixture prior

Having seen Bayesian field theoretical models working for Gaussian prior factors we will study in this section the slightly more complex prior mixture models. Prior mixture models are an especially useful tool for implementing complex and unsharp prior knowledge. They may be used, for example, to translate verbal statements of experts into quantitative prior densities [132,133,134,135,136], similar to the quantification of ``linguistic variables'' by fuzzy methods [118,119].

We will now study a prior mixture with Gaussian prior components in . Hence, consider the following energy functional with mixture prior

$\begin{displaymath} E_L = -\ln \sum_j p_j e^{-E_j} = - (L,N) +(e^L,\Lambda_X) -\ln \sum_j p_j e^{-\lambda E_{0,j}} \end{displaymath}$

(719)

with mixture components

$\begin{displaymath} E_j = -(L,N) + \lambda E_{0,j} +(e^L,\Lambda_X) . \end{displaymath}$

(720)

We choose Gaussian component prior factors with equal covariances but differing means

$\begin{displaymath} E_{0,j} = \frac{1}{2} \Big(L-t_j,\,{{\bf K}} (L-t_j)\Big) . \end{displaymath}$

(721)

Hence, the stationarity equation for Functional (719) becomes

$\begin{displaymath} 0 = N - \lambda {\bf K} \left(L - \sum_j a_j t_j \right) - {\bf e}^{\bf L} \Lambda_X , \end{displaymath}$

(722)

with Lagrange multiplier function

$\begin{displaymath} \Lambda_X = N_X - \lambda{\bf I}_X{\bf K} \left(L - \sum_j a_j t_j \right) , \end{displaymath}$

(723)

and mixture coefficients

$\begin{displaymath} a_j = \frac{p_j e^{-\lambda E_{0,j}}}{\sum_k p_k e^{-\lambda E_{0,k}}} . \end{displaymath}$

(724)

The parameter $\lambda$ plays here a similar role as the inverse temperature $\beta$ for prior mixtures in regression (see Sect. 6.3). In contrast to the $\beta$ -parameter in regression, however, the ``low temperature'' solutions for $\lambda\rightarrow \infty$ are the pure prior templates

, and for $\lambda\rightarrow 0$ the prior factor is switched off.

Typical numerical results of a prior mixture model with two mixture components are presented in Figs. 23 - 28. Like for Fig. 21, the true likelihood used for these calculations is given by Eq. (714) and shown in Fig. 18. The corresponding true regression function is thus that of Fig. 19. Also, the same training data have been used as for the model of Fig. 21 (Fig. 20). The two templates and which have been selected for the two prior mixture components are (Fig. 18)

$\displaystyle t_1(x,y)$	$\textstyle =$	$\displaystyle \frac{1}{2\sqrt{2\pi}\sigma_t} \left( e^{-\frac{(y-\mu_a)^2}{2\sigma_t^2}} + e^{-\frac{(y-\mu_b)^2}{2\sigma_t^2}} \right) ,$	(725)
$\displaystyle t_2(x,y)$	$\textstyle =$	$\displaystyle \frac{1}{\sqrt{2\pi}\sigma_t}e^{-\frac{(y-\mu_2)^2}{2\sigma_t^2}} ,$	(726)

with $\sigma_t$ =

, $\mu_a$ = $\mu_2+25/9$ =

, $\mu_b$ = $\mu_2-25/9$ =

, and $\mu_2$ =

. Both templates capture a bit of the structure of the true likelihood, but not too much, so learning remains interesting. The average test error of

is equal to 2.56 and is thus lower than that of

being equal to 2.90. The minimal possible average test error 2.23 is given by that of the true solution $P_{\rm true}$ . A uniform

, being the effective template in the zero mean case of Fig. 21, has with 2.68 an average test error between the two templates

and

Fig. 23 proves that convergence is fast for massive prior relaxation when starting from as initial guess $L^{(0)}$ . Compared to Fig. 21 the solution is a bit smoother, and as template is a better reference than the uniform likelihood the final test error is slightly lower than for the zero-mean Gaussian prior on . Starting from $L^{(0)}$ = convergence is not much slower and the final solution is similar, the test error being in that particular case even lower (Fig. 24). Starting from a uniform $L^{(0)}$ the mixture model produces a solution very similar to that of Fig. 21 (Fig. 25).

The effect of changing the $\lambda$ parameter of the prior mixture can be seen in Fig. 26 and Fig. 27. Larger $\lambda$ means a smoother solution and faster convergence when starting from a template likelihood (Fig. 26). Smaller $\lambda$ results in a more rugged solution combined with a slower convergence. The test error in Fig. 27 already indicates overfitting.

Prior mixture models tend to produce metastable and approximately stable solutions. Fig. 28 presents an example where starting with $L^{(0)}$ = the learning algorithm seems to have produced a stable solution after a few iterations. However, iterating long enough this decays into a solution with smaller distance to and with lower test error. Notice that this effect can be prevented by starting with another initialization, like for example with $L^{(0)}$ = or a similar initial guess.

We have seen now that, and also how, learning algorithms for Bayesian field theoretical models can be implemented. In this paper, the discussion of numerical aspects was focussed on general density estimation problems. Other Bayesian field theoretical models, e.g., for regression and inverse quantum problems, have also been proved to be numerically feasible. Specifically, prior mixture models for Gaussian regression are compared with so-called Landau-Ginzburg models in [132]. An application of prior mixture models to image completion, formulated as a Gaussian regression model, can be found in [137]. Furthermore, hyperparameter have been included in numerical calculations in [133] and also in [137]. Finally, learning algorithms for inverse quantum problems are treated in [143] for inverse quantum statistics, and, in combination with a mean field approach, in [142] for inverse quantum many-body theory. Time-dependent inverse quantum problems will be the topic of [138].

In conclusion, we may say that many different Bayesian field theoretical models have already been studied numerically and proved to be computationally feasible. This also shows that such nonparametric Bayesian approaches are relatively easy to adapt to a variety of quite different learning scenarios. Applications of Bayesian field theory requiring further studies include, for example, the prediction of time-series and the interactive implementation of unsharp a priori information.

**Figure 14:** Density estimation with 2 data points and a Gaussian prior factor for the log-probability . First row: Final and . Second row: The l.h.s. shows the energy (108) during iteration, the r.h.s. the regression function = $\int \!dy \, y p(y\vert x,h_{\rm true})$ = $\int \!dy \, y P_{\rm true}(x,y)$ . The dotted lines indicate the range of one standard deviation above and below the regression function (ignoring periodicity in ). The fast convergence shows that the problem is nearly linear. The asymmetry of the solution between the - and -direction is due to the normalization constraint, only required for . (Laplacian smoothness prior ${\bf K}$ as given in Eq. (705) with $\lambda _x$ = $\lambda _y$ = 1, $\lambda _0$ = 0, $\lambda _2$ = 0.025, $\lambda _4$ = $\lambda _6$ = 0. Iteration with negative Hessian ${\bf A}$ = $-{\bf H}$ if positive definite, otherwise with the gradient algorithm, i.e., ${\bf A}$ = ${\bf I}$ . Initialization with $L^{(0)}$ = $\ln (\tilde {\bf C} \tilde P_{\rm emp})$ , i.e., $L^{(0)}$ normalized to $\int \!dy\,e^L$ = , with $\tilde {\bf C}$ of Eq. (698) and ${\bf C}$ = $({\bf K}+m_C^2{\bf I})$ , = . Within each iteration step the optimal step width $\eta$ has been found by a line search. Mesh with points in -direction and points in -direction, periodic boundary conditions in and . The data points are and .)
$\begin{figure}\vspace{-4cm} \begin{center} \epsfig{file=ps/dens2LaHess.ps, width=132mm}\end{center}\vspace{-4.1cm} \end{figure}$

**Figure 15:** Density estimation with 2 data points, this time with a Gaussian prior factor for the probability , minimizing the energy functional (164). To make the figure comparable with Fig. 14 the parameters have been chosen so that the maximum of the solution is the same in both figures ( $\max P$ = 0.6). Notice, that compared to Fig. 14 the smoothness prior is less effective for small probabilities. (Same data, mesh and periodic boundary conditions as for Fig. 14. Laplacian smoothness prior ${\bf K}$ as in Eq. (705) with $\lambda _x$ = $\lambda _y$ = 1, $\lambda _0$ = 0, $\lambda _2$ = 1, $\lambda _4$ = $\lambda _6$ = 0. Iterated using massive prior relaxation, i.e., ${\bf A}$ = ${\bf K}+m^2{\bf I}$ with = 1.0. Initialization with $P^{(0)}$ = $\tilde {\bf C}\tilde P_{\rm emp}$ , with $\tilde {\bf C}$ of Eq. (698) so $P^{(0)}$ is correctly normalized, and ${\bf C}$ = $({\bf K}+m_C^2{\bf I})$ , = . Within each iteration step the optimal factor $\eta$ has been found by a line search algorithm.)
$\begin{figure}\vspace{-2cm} \begin{center} \epsfig{file=ps/dens2Paaa.ps, width=132mm}\end{center}\vspace{-4.1cm} \end{figure}$

**Figure 16:** Density estimation with a $(-\Delta ^3)$ Gaussian prior factor for the log-probability . Such a prior favors probabilities of Gaussian shape. (Smoothness prior ${\bf K}$ of the form of Eq. (705) with $\lambda _x$ = $\lambda _y$ = 1, $\lambda _0$ = 0, $\lambda _2$ = 0, $\lambda _4$ = 0, $\lambda _6$ = 0.01. Same iteration procedure, initialization, data, mesh and periodic boundary conditions as for Fig. 14.)
$\begin{figure}\vspace{-2cm} \begin{center} \epsfig{file=ps/dens2LgaussA.ps, width=132mm}\end{center}\vspace{-4cm} \end{figure}$

**Figure 17:** Density estimation with a $(-\Delta ^3)$ Gaussian prior factor for the probability . As the variation of is smaller than that of , a smaller $\lambda _6$ has been chosen than in Fig. 17. The Gaussian prior in is also relatively less effective for small probabilities than a comparable Gaussian prior in . (Smoothness prior ${\bf K}$ of the form of Eq. (705) with $\lambda _x$ = $\lambda _y$ = 1, $\lambda _0$ = 0, $\lambda _2$ = 0, $\lambda _4$ = 0, $\lambda _6$ = 0.1. Same iteration procedure, initialization, data, mesh and periodic boundary conditions as for Fig. 15.)
$\begin{figure}\vspace{-2cm} \begin{center} \epsfig{file=ps/dens2PgaussA.ps, width=132mm}\end{center}\vspace{-4cm} \end{figure}$

**Figure 18:** First row: True density $P_{\rm true}$ (l.h.s.) true log-density $L_{\rm true}$ = $\ln P_{\rm true}$ (r.h.s.) used for Figs. 21-28. Second and third row: The two templates and of Figs. 23-28 for (, l.h.s.) or for (, r.h.s.), respectively, with = $\ln t_i^P$ . As reference for the following figures we give the expected test error $\int \!dy\,dx\, p(x)p(y\vert x,h_{\rm true}) \ln p(y\vert x,h)$ under the true $p(y\vert x,h_{\rm true})$ for uniform . It is for $h_{\rm true}$ equal to 2.23 for template equal to 2.56, for template equal 2.90 and for a uniform equal to 2.68.
$\begin{figure}\vspace{-2cm} \begin{center} \epsfig{file=ps/mixTrue-P.eps, width=... ...s, width= 65mm}\epsfig{file=ps/mixT2-L.eps, width= 65mm}\end{center}\end{figure}$

**Figure 19:** Regression function $h_{\rm true}(x)$ for the true density $P_{\rm true}$ of Fig. 18, defined as = $\int \!dy \, y p(y\vert x,h_{\rm true})$ = $\int \!dy \, y P_{\rm true}(x,y)$ . The dashed lines indicate the range of one standard deviation above and below the regression function.
$\begin{figure}\begin{center} \epsfig{file=ps/mixTrue-R.eps, width= 60mm}\end{center}\end{figure}$

**Figure 20:** L.h.s.: Empirical density = $\sum _i\delta (x-x_i)\delta (y-y_i)/\sum _i 1$ . sampled from $p(x,y\vert h_{\rm true})$ = $p(y\vert x,h_{\rm true}) p(x)$ with uniform . R.h.s.: Corresponding conditional empirical density $P_{\rm emp}(x,y)$ = $({\bf N}_X^{-1} N)(x,y)$ = $\sum _i\delta (x-x_i)\sum _i\delta (y-y_i)\sum _i/\sum _i \delta (x-x_i)$ . Both densities are obtained from the 50 data points used for Figs. 21-28.
$\begin{figure}\begin{center} \epsfig{file=ps/mix50Data-P.eps, width= 65mm}\epsfig{file=ps/mix50Data-CP.eps, width= 65mm}\end{center}\end{figure}$

**Figure 21:** Density estimation with Gaussian prior factor for log-probability with 50 data points shown in Fig. 20. Top row: Final solution = $p(y\vert x,h)$ and $L=\ln P$ . Second row: Energy (108) during iteration and final regression function. Bottom row: Average training error $-(1/n)\sum_{i=1}^n\ln p(y_i\vert x_i,h)$ during iteration and average test error $-\int \!dy\,dx\, p(x)p(y\vert x,h_{\rm true}) \ln p(y\vert x,h)$ for uniform . (Parameters: Zero mean Gaussian smoothness prior with inverse covariance $\lambda {\bf K}$ , $\lambda$ = 0.5 and ${\bf K}$ of the form (705) with $\lambda _x$ = 2, $\lambda _y$ = 1, $\lambda _0$ = 0, $\lambda _2$ = 1, $\lambda _4$ = $\lambda _6$ = 0, massive prior iteration with ${\bf A}$ = ${\bf K}+m^2{\bf I}$ and squared mass = . Initialized with normalized constant . At each iteration step the factor $\eta$ has been adapted by a line search algorithm. Mesh with points in -direction and points in -direction, periodic boundary conditions in .)
$\begin{figure}\begin{center} \epsfig{file=ps/mix50jP.eps, width= 65mm}\epsfig{fi... ...sfig{file=ps/mix50jErr.eps, width= 50mm}\end{center}\vspace{-0.2cm} \end{figure}$

**Figure 22:** Comparison of iteration schemes and initialization. First row: Massive prior iteration (with ${\bf A}$ = ${\bf K}+m^2{\bf I}$ , = ) and uniform initialization. Second row: Hessian iteration ( ${\bf A}$ = $-{\bf H}$ ) and uniform initialization. Third row: Hessian iteration and kernel initialization (with ${\bf C}$ = ${\bf K}+m_C^2 {\bf I}$ , = and normalized afterwards). Forth row: Gradient ( ${\bf A}$ = ${\bf I}$ ) with uniform initialization. Fifth row: Gradient with kernel initialization. Sixth row: Gradient with delta-peak initialization. (Initial equal to $\ln(N/n+\epsilon)$ , $\epsilon$ = $10^{-10}$ , conditionally normalized. For see Fig. 20). Minimal number of iterations 4, maximal number of iterations 50, iteration stopped if $\vert L^{(i)}-L^{(i-1)}\vert<10^{-8}$ . Energy functional and parameters as for Fig. 21.
$\begin{figure}\vspace{-1.4cm} \begin{center} \epsfig{file=ps/mix50jEne.eps, widt... ...sfig{file=ps/mix50nErr.eps, width= 39mm}\end{center}\vspace{-0.5cm} \end{figure}$

**Figure 23:** Density estimation with a Gaussian mixture prior for log-probability with 50 data points, Laplacian prior and the two template functions shown in Fig. 18. Top row: Final solution = $p(y\vert x,h)$ and $L=\ln P$ . Second row: Energy Energy (719) during iteration and final regression function. Bottom row: Average training error - $(1/n)\sum_{i=1}^n\ln p(y_i\vert x_i,h)$ during iteration and average test error $-\int \!dy\,dx\, p(x)p(y\vert x,h_{\rm true}) \ln p(y\vert x,h)$ for uniform . (Two mixture components with $\lambda$ = 0.5 and smoothness prior with ${\bf K}_1$ = ${\bf K}_2$ of the form (705) with $\lambda _x$ = 2, $\lambda _y$ = 1, $\lambda _0$ = 0, $\lambda _2$ = 1, $\lambda _4$ = $\lambda _6$ = 0, massive prior iteration with ${\bf A}$ = ${\bf K}+m^2{\bf I}$ and squared mass = , initialized with = . At each iteration step the factor $\eta$ has been adapted by a line search algorithm. Mesh with = points in -direction and = points in -direction, = data points at , , periodic boundary conditions in . Except for the inclusion of two mixture components parameters are equal to those for Fig. 21. )
$\begin{figure}\vspace{-1.cm} \begin{center} \epsfig{file=ps/mix50aP.eps, width= ... ...sfig{file=ps/mix50aErr.eps, width= 50mm}\end{center}\vspace{-0.5cm} \end{figure}$

**Figure 24:** Using a different starting point. (Same parameters as for Fig. 23, but initialized with = .) While the initial guess is worse then that of Fig. 23, the final solution is even slightly better.
$\begin{figure}\begin{center} \epsfig{file=ps/mix50bP.eps, width= 65mm}\epsfig{fi... ... width= 50mm}\epsfig{file=ps/mix50bErr.eps, width= 50mm}\end{center}\end{figure}$

**Figure 25:** Starting from a uniform initial guess. (Same as Fig. 23, but initialized with uniform .) The resulting solution is, compared to Figs. 23 and 24, a bit more wiggly, i.e., more data oriented. One recognizes a slight ``overfitting'', meaning that the test error increases while the training error is decreasing. (Despite the increasing of the test error during iteration at this value of $\lambda$ , a better solution cannot necessarily be found by just changing $\lambda$ -value. This situation can for example occur, if the initial guess is better then the implemented prior.)
$\begin{figure}\begin{center} \epsfig{file=ps/mix50cP.eps, width= 65mm}\epsfig{fi... ... width= 50mm}\epsfig{file=ps/mix50cErr.eps, width= 50mm}\end{center}\end{figure}$

**Figure 26:** Large $\lambda$ . (Same parameters as for Fig. 23, except for $\lambda$ = 1.0.) Due to the larger smoothness constraint the averaged training error is larger than in Fig. 23. The fact that also the test error is larger than in Fig. 23 indicates that the value of $\lambda$ is too large. Convergence, however, is very fast.
$\begin{figure}\begin{center} \epsfig{file=ps/mix50dP.eps, width= 65mm}\epsfig{fi... ... width= 50mm}\epsfig{file=ps/mix50dErr.eps, width= 50mm}\end{center}\end{figure}$

**Figure 27:** Overfitting due to too small $\lambda$ . (Same parameters as for Fig. 23, except for $\lambda$ = 0.1.) A small $\lambda$ allows the average training error to become quite small. However, the average test error grows already after two iterations. (Having found at some $\lambda$ -value during iteration an increasing test error, it is often but not necessarily the case that a better solution can be found by changing $\lambda$ .)
$\begin{figure}\begin{center} \epsfig{file=ps/mix50eP.eps, width= 65mm}\epsfig{fi... ... width= 50mm}\epsfig{file=ps/mix50eErr.eps, width= 50mm}\end{center}\end{figure}$

**Figure 28:** Example of an approximately stable solution. (Same parameters as for Fig. 23, except for $\lambda$ = , = , and initialized with = .) A nearly stable solution is obtained after two iterations, followed by a plateau between iterations 2 and 6. A better solution is finally found with smaller distance to template . (The plateau gets elongated with growing mass .) The figure on the l.h.s. in the bottom row shows the mixing coefficients of the components of the prior mixture model for the solution during iteration (, line and , dashed).
$\begin{figure}\begin{center} \epsfig{file=ps/mix50fP.eps, width= 65mm}\epsfig{fi... ... width= 50mm}\epsfig{file=ps/mix50fErr.eps, width= 50mm}\end{center}\end{figure}$

Acknowledgements The author wants to thank Federico Girosi, Tomaso Poggio, Jörg Uhlig, and Achim Weiguny for discussions.

Next: Bibliography Up: Numerical examples Previous: Density estimation with Gaussian Contents

Joerg_Lemm 2001-01-21