ResiHP: Taming LLM Training Failures with Dynamic Hybrid

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

231K/year
🤖 AI Summary
Large-scale LLM training is highly susceptible to hardware failures and fluctuations in sequence lengths, which often lead to performance skew and misjudgments; existing fault-tolerance mechanisms are inefficient in addressing these issues. This work proposes ResiHP, a novel approach that integrates a workload-aware failure detector with a lightweight execution time predictor to prevent misjudgments. Furthermore, ResiHP introduces a dynamic hybrid-parallel scheduler that jointly optimizes parallel group size, model partitioning, and scheduling policies. Evaluated on a 256-GPU cluster, ResiHP achieves a training throughput improvement of 1.04× to 4.39× over state-of-the-art systems, significantly enhancing both training robustness and efficiency.
📝 Abstract
Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual devices lead to performance skew across devices, diminishing overall training efficiency. Existing resilient systems overlook sequence length variability in datasets and device performance skew under hybrid parallelism. As a result, (1) iteration time fluctuations induced by sequence length variability can trigger spurious fail-slow detections, and (2) failures are mitigated through individual adaptations in hybrid parallelism, leading to unnecessary detection overhead and inefficient resilient training. To respond, this paper presents ResiHP, a resilient system that enables robust failure detection and fine-grained adaptation for hybrid parallel training. First, we develop a Detector to accurately identify failures. In particular, it employs a workload-aware execution time predictor that disentangles failures from iteration time fluctuations while remaining lightweight for online detection. Second, we design a Scheduler that dynamically adapts parallelism group sizes, model partitioning, and workload scheduling policies to improve training efficiency under failures. Experiments show that ResiHP improves training throughput by 1.04-4.39$\times$ compared with state-of-the-art resilient training systems under diverse failure scenarios in a 256-GPU cluster.
Problem

Research questions and friction points this paper is trying to address.

hybrid parallelism
LLM training
hardware failures
sequence length variability
performance skew
Innovation

Methods, ideas, or system contributions that make the work stand out.

resilient training
hybrid parallelism
failure detection
dynamic scheduling
large language models
🔎 Similar Papers
No similar papers found.