Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address straggler issues caused by dynamic expert load imbalance during Mixture-of-Experts (MoE) model training—and the dual bottlenecks of memory inefficiency and scheduling latency in existing systems—this paper proposes Full-Sharded Sparse Data Parallelism (FSSDP), a novel parallel training paradigm. FSSDP achieves, for the first time, full cross-device sharding of both MoE parameters and optimizer states, while sparsely and dynamically constructing and reconstructing only the required parameter subsets per training step. It innovatively integrates heterogeneous-device sharding, fine-grained MoE layer state partitioning, and sparse AllGather/ReduceScatter primitives. Experiments demonstrate that FSSDP delivers up to 3.54× speedup over state-of-the-art MoE training systems, consistently achieving significant and stable performance gains across diverse MoE architectures and hardware platforms.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) has emerged as a promising sparse paradigm for scaling up pre-trained models (PTMs) with remarkable cost-effectiveness. However, the dynamic nature of MoE leads to rapid fluctuations and imbalances in expert loads during training, resulting in significant straggler effects that hinder training performance when using expert parallelism (EP). Existing MoE training systems attempt to mitigate these effects through expert rearrangement strategies, but they face challenges in terms of memory efficiency and timeliness of rearrangement. This paper proposes Fully Sharded Sparse Data Parallelism (FSSDP), an innovative approach that tackles the parallelization of MoE layers and potential straggler effects caused by imbalanced expert loads from a new perspective. FSSDP fully shards the parameters and optimizer states of MoE layers across devices and sparsely materializes MoE parameters from scratch in each iteration with two sparse collectives SparseAllGather and SparseReduceScatter. We build Hecate, a high-performance MoE training system that incorporates FSSDP to fully unlock its potential. Hecate introduces heterogeneous sharding, sparse materialization, and re-materialization techniques to construct flexible and efficient expert placements with low memory and communication overhead. Our evaluation reveals that Hecate achieves up to 3.54x speedup compared over state-of-the-art MoE training systems and consistently demonstrates improvements across model architectures and hardware environments.
Problem

Research questions and friction points this paper is trying to address.

Addresses MoE training inefficiencies
Reduces straggler effects in expert parallelism
Enhances memory and communication efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully Sharded Sparse Data Parallelism
SparseAllGather and SparseReduceScatter
Heterogeneous sharding and sparse materialization
🔎 Similar Papers
No similar papers found.
Yuhao Qing
Yuhao Qing
Dr. Cui Heming, The University of Hong Kong
machine learning systemsAI systemsGPU
G
Guichao Zhu
The University of Hong Kong, Hong Kong SAR, China
Fanxin Li
Fanxin Li
University of Hong Kong
L
Lintian Lei
The University of Hong Kong, Hong Kong SAR, China
Z
Zekai Sun
The University of Hong Kong, Hong Kong SAR, China
X
Xiuxian Guan
The University of Hong Kong, Hong Kong SAR, China
Shixiong Zhao
Shixiong Zhao
University of Hong Kong
Distributed system
Xusheng Chen
Xusheng Chen
Huawei Cloud
Distributed SystemsCloud ComputingDistributed Databases
D
Dong Huang
The University of Hong Kong, Hong Kong SAR, China
S
Sen Wang
Huawei Technologies, China
Heming Cui
Heming Cui
University of Hong Kong
Operating SystemsProgramming LanguageDistributed SystemsSecurity