🤖 AI Summary
To address two key challenges in federated fine-tuning of foundation models under heterogeneous edge environments—client-side LoRA configuration incompatibility and slow convergence/poor generalization induced by non-IID data—this paper proposes FFT-MoE. Methodologically, it replaces LoRA adapters with a sparse Mixture-of-Experts (MoE) architecture, integrating a lightweight gating network and heterogeneity-aware routing regularization to preserve client personalization while ensuring model aggregability. Additionally, an auxiliary load-balancing loss is introduced to dynamically coordinate expert assignment, mitigating the coupled effects of structural heterogeneity and data skew. Experimental results across diverse IID and non-IID settings demonstrate that FFT-MoE achieves significantly faster convergence (1.8× speedup on average) and improved generalization (+3.2% accuracy), while maintaining high communication efficiency and adaptability to heterogeneous device resources.
📝 Abstract
As FMs drive progress toward Artificial General Intelligence (AGI), fine-tuning them under privacy and resource constraints has become increasingly critical particularly when highquality training data resides on distributed edge devices. Federated Learning (FL) offers a compelling solution through Federated Fine-Tuning (FFT), which enables collaborative model adaptation without sharing raw data. Recent approaches incorporate Parameter-Efficient Fine-Tuning (PEFT) techniques such as Low Rank Adaptation (LoRA) to reduce computational overhead. However, LoRA-based FFT faces two major limitations in heterogeneous FL environments: structural incompatibility across clients with varying LoRA configurations and limited adaptability to non-IID data distributions, which hinders convergence and generalization. To address these challenges, we propose FFT MoE, a novel FFT framework that replaces LoRA with sparse Mixture of Experts (MoE) adapters. Each client trains a lightweight gating network to selectively activate a personalized subset of experts, enabling fine-grained adaptation to local resource budgets while preserving aggregation compatibility. To further combat the expert load imbalance caused by device and data heterogeneity, we introduce a heterogeneity-aware auxiliary loss that dynamically regularizes the routing distribution to ensure expert diversity and balanced utilization. Extensive experiments spanning both IID and non-IID conditions demonstrate that FFT MoE consistently outperforms state of the art FFT baselines in generalization performance and training efficiency.