Diversity-Driven Learning: Tackling Spurious Correlations and Data Heterogeneity in Federated Models

πŸ“… 2025-04-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address poor generalization and slow convergence in federated learning (FL) caused by non-independent and identically distributed (Non-IID) client data, class/attribute imbalance, and spurious correlations, this paper proposes FedDiverseβ€”a dynamic client selection algorithm. We further construct the first suite of seven vision benchmark datasets explicitly designed to capture multi-granularity imbalances and spurious correlations. Methodologically, we introduce, for the first time, a systematic six-dimensional metric for quantifying data heterogeneity and establish a novel client selection paradigm grounded in complementary distribution collaboration. Extensive experiments across the seven benchmarks demonstrate that FedDiverse consistently improves the average accuracy of mainstream FL methods by 3.2%, accelerates convergence by 21%, reduces communication and computational overhead, and enhances model robustness against distributional shifts and spurious patterns.

Technology Category

Application Category

πŸ“ Abstract
Federated Learning (FL) enables decentralized training of machine learning models on distributed data while preserving privacy. However, in real-world FL settings, client data is often non-identically distributed and imbalanced, resulting in statistical data heterogeneity which impacts the generalization capabilities of the server's model across clients, slows convergence and reduces performance. In this paper, we address this challenge by first proposing a characterization of statistical data heterogeneity by means of 6 metrics of global and client attribute imbalance, class imbalance, and spurious correlations. Next, we create and share 7 computer vision datasets for binary and multiclass image classification tasks in Federated Learning that cover a broad range of statistical data heterogeneity and hence simulate real-world situations. Finally, we propose FedDiverse, a novel client selection algorithm in FL which is designed to manage and leverage data heterogeneity across clients by promoting collaboration between clients with complementary data distributions. Experiments on the seven proposed FL datasets demonstrate FedDiverse's effectiveness in enhancing the performance and robustness of a variety of FL methods while having low communication and computational overhead.
Problem

Research questions and friction points this paper is trying to address.

Addressing statistical data heterogeneity in Federated Learning
Mitigating spurious correlations and class imbalance in FL
Improving model generalization and convergence in decentralized training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes six metrics for statistical data heterogeneity
Introduces seven diverse FL datasets for real-world simulation
Develops FedDiverse algorithm for client selection optimization
πŸ”Ž Similar Papers
No similar papers found.
G
Gergely D. N'emeth
ELLIS Alicante
E
Eros Fani
Polytechnic Institute of Turin, Basque Center for Applied Mathematics
Y
Yeat Jeng Ng
University of Sussex
Barbara Caputo
Barbara Caputo
DAUIN, Politecnico di Torino
Artificial intelligenceComputer Visionintelligent SystemsMulti Modal Learning
M
Miguel 'Angel Lozano
University of Alicante
N
Nuria Oliver
ELLIS Alicante
Novi Quadrianto
Novi Quadrianto
Professor of Machine Learning, University of Sussex UK, BCAM Spain, Monash Indonesia
Trustworthy Machine Learning