Title

Implementation and Evaluation of Maximum Coverage Clustering Algorithms in the Differentially Private Framework

Abstract

We empirically investigate the practicality of differentially private clustering algorithms that use a private Maximum Coverage subroutine. Therefore, we present the implementations of two differentially private clustering algorithms for the 𝑘 -means and 𝑘 -medians problems. The implementations are based on two recent publications by Jones et al. and Nguyen et al. Each implementation is executed on 20 benchmark datasets covering various cluster-, dimension-, and dataset sizes. To investigate the effect of varying privacy settings on the resulting clusters, multiple privacy configurations per benchmark dataset are used for the algorithms. For the evaluation, we compare the results of the experiments with baseline results obtained by using non-private clustering algorithms on the benchmark datasets. Particularly, the runtime and utility of the algorithms are measured and evaluated by comparison against the baselines. We find that the implementation of the private 𝑘-medians yields long runtimes when executed on datasets with a high number of clusters and a high number of of data points. In turn, the utility of private 𝑘-medians is comparable to the utility of the non-private baseline results for all configurations that have been tested. The evaluation of the private 𝑘-means indicates weaknesses of the algorithm that affect its practicality. The runtime evaluation reveals a linear dependence of the runtime on the number of demand points. Further, the results display the exponential increase of runtime for certain demand points thresholds. The utility evaluation of the private 𝑘-means shows sufficient utility results for datasets with a high amount of demand points and few clusters. Datasets with few demand points or many clusters exhibit an increased loss in utility. Further, the results show that datasets with high diameters significantly decrease the utility of the private 𝑘-means algorithm. The results suggest that the increased loss of utility is due to noise added by a differentially private subroutine, i.e., the Noisy Average algorithm. Given a dataset with a low amount of demand points or a large diameter, the results distorted by the Noisy Average subroutine may be yield a worse utility than results obtained randomly within the bounds of the respective dataset.

Requirements

Clustering methods, especially k-Means + k-Medians

Person working on it

Anton Kosjakov

Category

Master thesis