FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

📅 2025-01-18

📈 Citations: 0

✨ Influential: 0

career value

262K/year

🤖 AI Summary

Existing MoE training systems face scalability bottlenecks—including computational load imbalance, high inter-node communication overhead, and gradient aggregation skew. This paper proposes an efficient training system tailored for sparse Mixture-of-Experts (MoE) models. Our approach addresses these challenges through three core innovations: (1) a unified module abstraction coupled with online performance profiling to enable dynamic scheduling; (2) a computation-communication co-scheduling framework that overlaps communication with computation and supports asynchronous gradient sharding; and (3) adaptive gradient pipelined aggregation combined with token-level routing optimization. The system supports four mainstream routing functions. Experimental results show 1.18–1.22× speedup on a 1458-layer MoE model and 1.19–3.01× improvement over DeepSpeed-MoE and Tutel on GPT-2 and Mixtral benchmarks. Our system significantly enhances both training efficiency and scalability for large-scale sparse MoE models.

Technology Category

Application Category

📝 Abstract

Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters. Experimental results show that 1) our FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations (with up to a 1.42$ imes$ speedup), and 2) FSMoE outperforms the state-of-the-art MoE training systems (DeepSpeed-MoE and Tutel) by 1.18$ imes$-1.22$ imes$ on 1458 MoE layers and 1.19$ imes$-3.01$ imes$ on real-world MoE models based on GPT-2 and Mixtral using a popular routing function.

Problem

Research questions and friction points this paper is trying to address.

Scalability

Efficiency

Large Language Model Training

Innovation

Methods, ideas, or system contributions that make the work stand out.

FSMoE

MoE_Model_Training

GPU_Performance_Enhancement

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions