Differentially Private Federated $k$-Means Clustering with Server-Side Data

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of clustering distributed, privacy-sensitive data across edge devices without centralized aggregation, this paper proposes a novel k-means clustering framework integrating differential privacy (DP) and federated learning. The core method introduces a small set of non-distributed, non-sensitive auxiliary data at the server side to initialize the DP-constrained federated clustering process—marking the first such use in DP-Lloyd variants—and thereby mitigates initialization bias and slow convergence caused by DP noise. Theoretically, we establish convergence guarantees and derive an upper bound on cluster identification success probability. Empirically, our approach significantly outperforms existing federated private clustering methods on both synthetic and real-world benchmark datasets, achieving faster convergence, higher clustering quality, and rigorous ε-differential privacy.

Technology Category

Application Category

📝 Abstract
Clustering is a cornerstone of data analysis that is particularly suited to identifying coherent subgroups or substructures in unlabeled data, as are generated continuously in large amounts these days. However, in many cases traditional clustering methods are not applicable, because data are increasingly being produced and stored in a distributed way, e.g. on edge devices, and privacy concerns prevent it from being transferred to a central server. To address this challenge, we present acronym, a new algorithm for $k$-means clustering that is fully-federated as well as differentially private. Our approach leverages (potentially small and out-of-distribution) server-side data to overcome the primary challenge of differentially private clustering methods: the need for a good initialization. Combining our initialization with a simple federated DP-Lloyds algorithm we obtain an algorithm that achieves excellent results on synthetic and real-world benchmark tasks. We also provide a theoretical analysis of our method that provides bounds on the convergence speed and cluster identification success.
Problem

Research questions and friction points this paper is trying to address.

Federated k-means clustering with privacy constraints
Overcoming initialization challenges in private clustering
Handling distributed data without central collection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated k-means with differential privacy
Server-side data for better initialization
Combines DP-Lloyds for federated clustering
🔎 Similar Papers
No similar papers found.
J
Jonathan Scott
Institute of Science and Technology Austria (ISTA)
Christoph H. Lampert
Christoph H. Lampert
Institute of Science and Technology Austria (ISTA)
Trustworthy Machine LearningLifelong Machine LearningTransfer Learning
D
David Saulpic
CNRS & Université Paris Cité, Paris, France