MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training

πŸ“… 2024-08-08
πŸ›οΈ Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high checkpointing overhead and low fault tolerance efficiency in distributed training of ultra-large-scale sparse Mixture-of-Experts (MoE) models, this paper proposes MoC-System, a hybrid checkpointing system. Methodologically, it introduces (1) Partial Expert Checkpointing (PEC), a novel mechanism that selects expert subsets to achieve algorithm-system co-optimization; and (2) a two-level asynchronous checkpointing manager that decouples in-memory snapshotting from persistent storage, integrating fully sharded storage with ZeRO-2 and expert parallelism. Evaluated within the Megatron-DeepSpeed framework, MoC-System reduces per-checkpoint overhead by up to 98.9% while improving average downstream task accuracy by 1.08%β€”without any accuracy loss. This work constitutes the first systematic solution to efficient fault tolerance for highly scalable MoE model training.

Technology Category

Application Category

πŸ“ Abstract
As large language models continue to scale up, distributed training systems have expanded beyond 10k nodes, intensifying the importance of fault tolerance. Checkpoint has emerged as the predominant fault tolerance strategy, with extensive studies dedicated to optimizing its efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model presents new challenges due to the substantial increase in model size, despite comparable computational demands to dense models. In this work, we propose the Mixture-of-Checkpoint System (MoC-System) to orchestrate the vast array of checkpoint shards produced in distributed training systems. MoC-System features a novel Partial Experts Checkpointing (PEC) mechanism, an algorithm-system co-design that strategically saves a selected subset of experts, effectively reducing the MoE checkpoint size to levels comparable with dense models. Incorporating hybrid parallel strategies, MoC-System involves fully sharded checkpointing strategies to evenly distribute the workload across distributed ranks. Furthermore, MoC-System introduces a two-level checkpointing management method that asynchronously handles in-memory snapshots and persistence processes. We build MoC-System upon the Megatron-DeepSpeed framework, achieving up to a 98.9% reduction in overhead for each checkpointing process compared to the original method, during MoE model training with ZeRO-2 data parallelism and expert parallelism. Additionally, extensive empirical analyses substantiate that our methods enhance efficiency while maintaining comparable model accuracy, even achieving an average accuracy increase of 1.08% on downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Efficient fault tolerance for sparse Mixture-of-Experts training
Reducing checkpoint size in distributed MoE model systems
Optimizing checkpoint overhead while maintaining model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Partial Experts Checkpointing reduces MoE checkpoint size
Hybrid parallel strategies distribute workload evenly
Two-level checkpointing manages snapshots and persistence
πŸ”Ž Similar Papers
No similar papers found.