SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of frequent node failures in ultra-large-scale (100,000+ GPU) large language model pretraining, where restart overhead dominates training time and existing fault-tolerance mechanisms fall short. To this end, we propose SPARe, a novel framework that integrates stacked parallelism with adaptive gradient synchronization reordering, introducing redundant data shards across parallel groups to mask failure impacts. SPARe achieves high availability comparable to traditional replication with only a modest 2–3× constant computational overhead while enabling efficient fault tolerance. Through joint optimization of redundancy and checkpointing, SimGrid-based simulation, and closed-form theoretical analysis, we demonstrate that SPARe reduces training time by 40%–50% compared to conventional replication at scales up to 600,000 GPUs, significantly enhancing scalability and efficiency.

Technology Category

Application Category

📝 Abstract

In large-scale LLM pre-training systems with 100k+ GPUs, failures become the norm rather than the exception, and restart costs can dominate wall-clock training time. However, existing fault-tolerance mechanisms are largely unprepared for this restart-dominant regime. To address this challenge, we propose SPARe - Stacked Parallelism with Adaptive Reordering - a fault-tolerance framework that masks node failures during gradient synchronization by stacking redundant data shards across parallelism groups and adaptively reordering execution. SPARe achieves availability comparable to traditional replication while maintaining near-constant computation overhead of only 2~3x, even under high redundancy where traditional replication would require linearly inflating overhead. We derive closed-form expressions for endurable failure count and computation overhead, validate them via SimGrid-based discrete-event simulation, and jointly optimize redundancy and checkpointing to minimize time-to-train. At extreme scale with up to 600k GPUs, SPARe reduces time-to-train by 40~50% compared to traditional replication.

Problem

Research questions and friction points this paper is trying to address.

fault tolerance

large-scale LLM pretraining

GPU failures

restart overhead

extreme-scale training

Innovation

Methods, ideas, or system contributions that make the work stand out.

fault tolerance

large-scale LLM training

stacked parallelism