MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training

📅 2024-08-08

🏛️ Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

📈 Citations: 1

✨ Influential: 0

📄 PDF

career value

227K/year

🤖 AI Summary

To address the high checkpointing overhead and low fault tolerance efficiency in distributed training of ultra-large-scale sparse Mixture-of-Experts (MoE) models, this paper proposes MoC-System, a hybrid checkpointing system. Methodologically, it introduces (1) Partial Expert Checkpointing (PEC), a novel mechanism that selects expert subsets to achieve algorithm-system co-optimization; and (2) a two-level asynchronous checkpointing manager that decouples in-memory snapshotting from persistent storage, integrating fully sharded storage with ZeRO-2 and expert parallelism. Evaluated within the Megatron-DeepSpeed framework, MoC-System reduces per-checkpoint overhead by up to 98.9% while improving average downstream task accuracy by 1.08%—without any accuracy loss. This work constitutes the first systematic solution to efficient fault tolerance for highly scalable MoE model training.

Technology Category

Application Category

📝 Abstract

As large language models continue to scale up, distributed training systems have expanded beyond 10k nodes, intensifying the importance of fault tolerance. Checkpoint has emerged as the predominant fault tolerance strategy, with extensive studies dedicated to optimizing its efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model presents new challenges due to the substantial increase in model size, despite comparable computational demands to dense models. In this work, we propose the Mixture-of-Checkpoint System (MoC-System) to orchestrate the vast array of checkpoint shards produced in distributed training systems. MoC-System features a novel Partial Experts Checkpointing (PEC) mechanism, an algorithm-system co-design that strategically saves a selected subset of experts, effectively reducing the MoE checkpoint size to levels comparable with dense models. Incorporating hybrid parallel strategies, MoC-System involves fully sharded checkpointing strategies to evenly distribute the workload across distributed ranks. Furthermore, MoC-System introduces a two-level checkpointing management method that asynchronously handles in-memory snapshots and persistence processes. We build MoC-System upon the Megatron-DeepSpeed framework, achieving up to a 98.9% reduction in overhead for each checkpointing process compared to the original method, during MoE model training with ZeRO-2 data parallelism and expert parallelism. Additionally, extensive empirical analyses substantiate that our methods enhance efficiency while maintaining comparable model accuracy, even achieving an average accuracy increase of 1.08% on downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Efficient fault tolerance for sparse Mixture-of-Experts training

Reducing checkpoint size in distributed MoE model systems

Optimizing checkpoint overhead while maintaining model accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Partial Experts Checkpointing reduces MoE checkpoint size

Hybrid parallel strategies distribute workload evenly

Two-level checkpointing manages snapshots and persistence

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions