🤖 AI Summary
Distributed training of large language models (LLMs) frequently suffers from hardware/software failures, and existing fault-tolerance techniques—such as redundant computation, dynamic parallelism, and data rerouting—impose persistent overhead, incur long recovery latency, or degrade throughput.
Method: This paper proposes an adaptive fault-tolerance framework featuring a unified performance model coupled with a fast execution-plan search mechanism to enable real-time optimal recovery strategy selection; it synergistically integrates redundant computation, dynamic parallelism, and data rerouting, augmented with communication optimization and fine-grained performance estimation to eliminate inherent overheads of prior approaches.
Contribution/Results: Evaluated on a 32-GPU cluster, the framework limits post-failure performance degradation to ≤11.00%, achieves 1.229× and 1.355× higher average throughput than Oobleck and Recycle, respectively, and preserves model convergence and memory efficiency.
📝 Abstract
Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Odyssey, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Odyssey achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Odyssey maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Odyssey achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.