🤖 AI Summary
To address privacy-sensitive clustering of large-scale binary and categorical data under federated learning, this paper proposes the first variational inference-based federated Bayesian mixture modeling framework. Each client performs local variational inference and uploads only lightweight sufficient statistics—never raw data—thereby ensuring strict data locality. Model structure discovery is achieved via intra-batch “merge-and-drop” and inter-batch “global merge” strategies, preserving global statistical consistency. The framework provides formal privacy guarantees without compromising model expressiveness or clustering fidelity. Extensive experiments on synthetic data, benchmark datasets, and real-world large-scale electronic health records (EHR) demonstrate that our method significantly outperforms state-of-the-art federated and centralized clustering algorithms in accuracy, while exhibiting superior scalability and robust privacy protection.
📝 Abstract
We present a federated learning approach for Bayesian model-based clustering of large-scale binary and categorical datasets. We introduce a principled 'divide and conquer' inference procedure using variational inference with local merge and delete moves within batches of the data in parallel, followed by 'global' merge moves across batches to find global clustering structures. We show that these merge moves require only summaries of the data in each batch, enabling federated learning across local nodes without requiring the full dataset to be shared. Empirical results on simulated and benchmark datasets demonstrate that our method performs well in comparison to existing clustering algorithms. We validate the practical utility of the method by applying it to large scale electronic health record (EHR) data.