ResiHP: Taming LLM Training Failures with Dynamic Hybrid

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Large-scale LLM training is highly susceptible to hardware failures and fluctuations in sequence lengths, which often lead to performance skew and misjudgments; existing fault-tolerance mechanisms are inefficient in addressing these issues. This work proposes ResiHP, a novel approach that integrates a workload-aware failure detector with a lightweight execution time predictor to prevent misjudgments. Furthermore, ResiHP introduces a dynamic hybrid-parallel scheduler that jointly optimizes parallel group size, model partitioning, and scheduling policies. Evaluated on a 256-GPU cluster, ResiHP achieves a training throughput improvement of 1.04× to 4.39× over state-of-the-art systems, significantly enhancing both training robustness and efficiency.

📝 Abstract

Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual devices lead to performance skew across devices, diminishing overall training efficiency. Existing resilient systems overlook sequence length variability in datasets and device performance skew under hybrid parallelism. As a result, (1) iteration time fluctuations induced by sequence length variability can trigger spurious fail-slow detections, and (2) failures are mitigated through individual adaptations in hybrid parallelism, leading to unnecessary detection overhead and inefficient resilient training. To respond, this paper presents ResiHP, a resilient system that enables robust failure detection and fine-grained adaptation for hybrid parallel training. First, we develop a Detector to accurately identify failures. In particular, it employs a workload-aware execution time predictor that disentangles failures from iteration time fluctuations while remaining lightweight for online detection. Second, we design a Scheduler that dynamically adapts parallelism group sizes, model partitioning, and workload scheduling policies to improve training efficiency under failures. Experiments show that ResiHP improves training throughput by 1.04-4.39$\times$ compared with state-of-the-art resilient training systems under diverse failure scenarios in a 256-GPU cluster.

Problem

Research questions and friction points this paper is trying to address.

hybrid parallelism

LLM training

hardware failures

sequence length variability

performance skew

Innovation

Methods, ideas, or system contributions that make the work stand out.

resilient training

hybrid parallelism

failure detection