🤖 AI Summary
To address the challenge of clustering distributed, privacy-sensitive data across edge devices without centralized aggregation, this paper proposes a novel k-means clustering framework integrating differential privacy (DP) and federated learning. The core method introduces a small set of non-distributed, non-sensitive auxiliary data at the server side to initialize the DP-constrained federated clustering process—marking the first such use in DP-Lloyd variants—and thereby mitigates initialization bias and slow convergence caused by DP noise. Theoretically, we establish convergence guarantees and derive an upper bound on cluster identification success probability. Empirically, our approach significantly outperforms existing federated private clustering methods on both synthetic and real-world benchmark datasets, achieving faster convergence, higher clustering quality, and rigorous ε-differential privacy.
📝 Abstract
Clustering is a cornerstone of data analysis that is particularly suited to identifying coherent subgroups or substructures in unlabeled data, as are generated continuously in large amounts these days. However, in many cases traditional clustering methods are not applicable, because data are increasingly being produced and stored in a distributed way, e.g. on edge devices, and privacy concerns prevent it from being transferred to a central server. To address this challenge, we present acronym, a new algorithm for $k$-means clustering that is fully-federated as well as differentially private. Our approach leverages (potentially small and out-of-distribution) server-side data to overcome the primary challenge of differentially private clustering methods: the need for a good initialization. Combining our initialization with a simple federated DP-Lloyds algorithm we obtain an algorithm that achieves excellent results on synthetic and real-world benchmark tasks. We also provide a theoretical analysis of our method that provides bounds on the convergence speed and cluster identification success.