Integer hyperparameters

The hyperparameters $\theta$ considered up to now have been real numbers, or vector of real numbers. Such hyperparameters can describe continuous transformations, like the translation, rotation or scaling of template functions and the scaling of inverse covariance operators. For real $\theta$ and differentiable posterior, stationarity conditions can be found by differentiating the posterior with respect to $\theta$ .

Instead of a class of continuous transformations a finite number of alternative template functions or inverse covariances may be given. For example, an image to be reconstructed might be expected to show a digit between zero and nine, a letter from some alphabet, or the face of someone who is a member of known group of people. Similarly, a particular times series may be expected to be either in a high or in a low variance regime. In all these cases, there exist a finite number of classes

which could be represented by specific templates

or inverse covariances ${\bf K}_i$ . Such ``class'' variables

are nothing else than hyperparameters $\theta$ with integer values.

Binary parameters, for example, allow to select from two reference functions or two inverse covariances that one which fits the data best. E.g., for

= $\theta\in \{0,1\}$ one can write

$\displaystyle t(\theta)$	$\textstyle =$	$\displaystyle (1-\theta) t_1 + \theta t_2,$	(496)
$\displaystyle {\bf K}(\theta)$	$\textstyle =$	$\displaystyle (1-\theta) {\bf K}_1 + \theta {\bf K}_2.$	(497)

For integer $\theta$ the integral $\int\! d\theta$ becomes a sum $\sum_\theta$ (we will also use the letter

and write $\sum_i$ for integer hyperparameters), so that prior, posterior, and predictive density have the form of a finite mixture with components $\theta$ .

For a moderate number of components one may be able to include all of the mixture components. Such prior mixture models will be studied in Section 6.

If the number of mixture components is too large to include them all explicitly, one again must restrict to some of them. One possibility is to select a random sample using Monte-Carlo methods. Alternatively, one may search for the $\theta^*$ with maximal posterior. In contrast to typical optimization problems for real variables, the corresponding integer optimization problems are usually not very smooth with respect to $\theta$ (with smoothness defined in terms of differences instead of derivatives), and are therefore often much harder to solve.

There exists, however, a variety of deterministic and stochastic integer optimization algorithms, which may be combined with ensemble methods like genetic algorithms [98,79,44,157,121,209,160], and with homotopy methods, like simulated annealing [114,156,199,43,1,203,243,68,244,245]. Annealing methods are similar to (Markov chain) Monte-Carlo methods, which aim in sampling many points from a specific distribution (i.e., for example at fixed temperature). For them it is important to have (nearly) independent samples and the correct limiting distribution of the Markov chain. For annealing methods the aim is to find the correct minimum (i.e., the ground state having zero temperature) by smoothly changing the temperature from a finite value to zero. For them it is less important to model the distribution for nonzero temperatures exactly, but it is important to use an adequate cooling scheme for lowering the temperature.

Instead of an integer optimization problem one may also try to solve a similar problem for real $\theta$ . For example, the binary $\theta\in \{0,1\}$ in Eqs. (496) and (497) may be extended to real $\theta\in [0,1]$ . By smoothly increasing an appropriate additional hyperprior $p(\theta)$ one can finally enforce again binary hyperparameters $\theta\in \{0,1\}$ .