π€ AI Summary
To address the challenges of cold-start adaptation, privacy sensitivity, and non-independent-and-identically-distributed (Non-IID) data in federated learning (FL), this paper proposes FedMoEβthe first federated Mixture-of-Experts framework. Unlike conventional approaches relying on explicit task partitioning, FedMoE jointly optimizes a dynamic gating mechanism and distributed expert modules to enable adaptive, cross-client routing of domain knowledge: client diversity is leveraged to train class-specialized experts, while pre-trained general-purpose experts enhance zero-shot transferability. The method achieves plug-and-play personalization without requiring per-client fine-tuning. Evaluated on standard FL benchmarks, FedMoE improves accuracy by up to 18% over state-of-the-art baselines. It further demonstrates strong zero-shot generalization, communication efficiency (linear in client count), and linear scalability with respect to the number of clients.
π Abstract
One of the goals in Federated Learning (FL) is to create personalized models that can adapt to the context of each participating client, while utilizing knowledge from a shared global model. Yet, often, personalization requires a fine-tuning step using clients' labeled data in order to achieve good performance. This may not be feasible in scenarios where incoming clients are fresh and/or have privacy concerns. It, then, remains open how one can achieve just-in-time personalization in these scenarios. We propose FedJETs, a novel solution by using a Mixture-of-Experts (MoE) framework within a FL setup. Our method leverages the diversity of the clients to train specialized experts on different subsets of classes, and a gating function to route the input to the most relevant expert(s). Our gating function harnesses the knowledge of a pretrained model common expert to enhance its routing decisions on-the-fly. As a highlight, our approach can improve accuracy up to 18% in state of the art FL settings, while maintaining competitive zero-shot performance. In practice, our method can handle non-homogeneous data distributions, scale more efficiently, and improve the state-of-the-art performance on common FL benchmarks.