🤖 AI Summary
Large-scale LLM pretraining on 10⁵–10⁶ accelerators suffers from frequent failures, demanding an elastic fault-tolerance mechanism that jointly ensures parameter consistency, low recovery latency, high throughput after rescaling, and computational consistency. This paper proposes the first distributed training system satisfying all four constraints. It innovatively integrates multi-dimensional scheduling (computation graph, dataflow, frequency, and random-number generation), online pipelined resharding, asynchronous parameter migration, DVFS-based power-frequency control, ZeRO partition interleaving, and dynamic communication group construction. Evaluated on a 96-NPU cluster, our system achieves 1.35× and 1.60× higher training throughput than ReCycle and TorchFT, respectively; reconstructs communication groups in under 1 second; reduces mean time to recover (MTTR) for parameter migration by 51%; and cuts convergence deviation by 78%.
📝 Abstract
Large-scale LLM pretraining today spans $10^{5}$--$10^{6}$ accelerators, making failures commonplace and elasticity no longer optional. We posit that an elastic-native training system must simultaneously ensure (i) Parameter Consistency, (ii) low Mean Time to Recovery (MTTR), (iii) high post-change Throughput, and (iv) Computation Consistency. This objective set not has never been jointly attained by prior work. To achieve these goals, we present ElasWave, which provides per-step fault tolerance via multi-dimensional scheduling across Graph, Dataflow, Frequency, and Random Number Generation. ElasWave resizes and reshards micro-batch workloads while preserving the global batch size and gradient scale; it performs online pipeline resharding with asynchronous parameter migration, interleaving ZeRO partitions so recovery reduces to disjoint rank-to-rank transfers. It further uses DVFS to absorb pipeline bubbles and reshards RNG to keep consistent computations. A dynamic communicator enables in-place communication group edits, while per-step in-memory snapshots support online verification and redistribution. We evaluated ElasWave on 96 NPUs and benchmarked against state-of-the-art baselines: throughput improves by $1.35 imes$ over ReCycle and $1.60 imes$ over TorchFT; communicator recovery completes within one second (up to $82 imes/3.6 imes$ faster than full/partial rebuilds); migration MTTR drops by as much as $51%$; and convergence deviation is reduced by approximately $78%$.