Distance-preserving vector space embedding for generalized median based consensus learning

Andreas Nienkötter, Xiaoyi Jiang

Learning a consensus object from a set of given objects is a core problem in machine learning and pattern recognition. One example is text recognition, where the use of different algorithms or parameters result in different recognized texts. Consensus learning would result in one text which hopefully includes less errors than each single result.

One method to calculate this result is generlized median calculation. The generalized median of a set of objects is a new object which has the smallest sum of distances to all objects in the set. The calculation of the generalized median is often NP-Hard, for example using strings with the string edit distance. Therfore, approximative solutions are needed. Recently, prototype embedding methods were successfully used to calculate the median. Prototype embedding consists of three steps (see Figure 1):

  1. Embed objects into a vector space in a way that the vectors have similar distances as the objects
  2. Calculate the median in vector space
  3. Reconstruct the generalized median object from its vector representation
Dpe-overview
Figure 1: Prototype Embedding
© WWU

All these steps can be done easily in vector space. For the first step, prototype embedding is used, which assigns each object to a vector consisting of the objects distance to a number of selected prototype objects. While this can be done very fast, the distances of the vectors are often quite different from the distances of the objects, resulting in inaccurate median calculations. In "Distance-preserving vector space embedding for generalized median based consensus learning", we use distance preserving embedding methods to improve these results. These methods calculate embeddings which explicitly try to preserve distances of objects in vector space, resulting in better median approximations.

Matlab Toolbox

Here you can download a toolbox based on this work to calculate the generlized median for any set of objects, only needing a distance function and weighted mean function between two objects which often can be derived from the distance function. Several different embedding methods are included, for example:

  • FastMap
  • SparseMap
  • MetricMap
  • Multidimensional Scaling (MDS)
  • Sammon Mapping
  • Curvilinear Component Analysis (CCA)
  • t-Distributed Stochastic Neighbor Embedding (t-SNE)
  • Maximum Variance Unfolding (MVU)
  • Locally Linear Embedding (LLE)
  • IsoMap
  • Prototype Embedding

Included are examples using strings and the string edit distance (Levenshtein distance).

Notes

  • 01.12.2016: Updated toolbox to the current version 1.1.