Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current non-IID modeling in federated learning (FL) overly relies on label distribution skew, failing to capture the genuine data heterogeneity prevalent in computer vision tasks—leading to systematic overestimation of algorithmic performance. To address this, we propose a novel paradigm: *embedding-driven data heterogeneity*. Leveraging features extracted from pretrained models, we characterize task-specific distributional divergence via clustering in the embedding space and introduce a Dirichlet-guided embedding clustering partitioning scheme to generate more realistic non-IID benchmarks. Extensive evaluation across multiple vision tasks demonstrates significant performance degradation of mainstream FL algorithms under this embedding-based heterogeneity. Furthermore, we propose new evaluation metrics, and publicly release the benchmark datasets and implementation code to advance FL research grounded in realistic data heterogeneity.

Technology Category

Application Category

📝 Abstract
Federated Learning (FL) represents a paradigm shift in distributed machine learning (ML), enabling clients to train models collaboratively while keeping their raw data private. This paradigm shift from traditional centralized ML introduces challenges due to the non-iid (non-independent and identically distributed) nature of data across clients, significantly impacting FL's performance. Existing literature, predominantly model data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the real-world data heterogeneity among clients in computer vision tasks beyond classification. Subsequently, we demonstrate that current approaches overestimate FL's performance by relying on label/class distribution skew, exposing an overlooked gap in the literature. By utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. We further unveil a series of open research directions that can be pursued.
Problem

Research questions and friction points this paper is trying to address.

Redefining non-IID data in Federated Learning for computer vision tasks.
Addressing limitations of label distribution skew in capturing data heterogeneity.
Introducing embedding-based data heterogeneity for task-specific distributions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes pre-trained deep neural networks for embeddings
Introduces embedding-based data heterogeneity concept
Clusters data points using Dirichlet distribution
🔎 Similar Papers
No similar papers found.