Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

📅 2024-07-05
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
Training sparse activation Mixture-of-Experts (MoE) models on preemptible cloud instances suffers from high restart overhead and GPU underutilization due to frequent node failures. Method: This paper proposes the first fault-tolerant, elastic training framework tailored for expert parallelism. Its core innovations are: (1) a provably optimal adaptive expert replica allocation and placement algorithm that jointly optimizes load balancing and failure recovery probability; and (2) an elastic token dispatcher coupled with a failure-aware dynamic reconfiguration mechanism enabling full-node resource reuse within seconds. Results: Experiments show that under high-failure conditions, throughput reaches 5.7× that of state-of-the-art systems; evaluation on real AWS Spot instance traces yields a 3.4× improvement, achieving zero GPU idle time and seamless recovery.

Technology Category

Application Category

📝 Abstract
Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to further scale large language models (LLMs) due to its sub-linear scaling for computation costs. However, frequent failures still pose significant challenges as training scales. The cost of even a single failure is significant, as all GPUs need to wait idle until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints. Existing solutions for efficient fault-tolerant training either lack elasticity or rely on building resiliency into pipeline parallelism, which cannot be applied to MoE models due to the expert parallelism strategy adopted by the MoE architecture. We present Lazarus, a system for resilient and elastic training of MoE models. Lazarus adaptively allocates expert replicas to address the inherent imbalance in expert workload and speeds-up training, while a provably optimal expert placement algorithm is developed to maximize the probability of recovery upon failures. Through adaptive expert placement and a flexible token dispatcher, Lazarus can also fully utilize all available nodes after failures, leaving no GPU idle. Our evaluation shows that Lazarus outperforms existing MoE training systems by up to 5.7x under frequent node failures and 3.4x on a real spot instance trace.
Problem

Research questions and friction points this paper is trying to address.

Addressing frequent failures in MoE model training
Handling expert workload imbalance with adaptive replica allocation
Optimizing expert placement for failure recovery and GPU utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptively allocates expert replicas for workload balance
Uses optimal expert placement to maximize failure recovery
Employs flexible token dispatcher to utilize all nodes
🔎 Similar Papers
No similar papers found.