Odyssey: Adaptive Policy Selection for Resilient Distributed Training

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Distributed training of large language models (LLMs) frequently suffers from hardware/software failures, and existing fault-tolerance techniques—such as redundant computation, dynamic parallelism, and data rerouting—impose persistent overhead, incur long recovery latency, or degrade throughput. Method: This paper proposes an adaptive fault-tolerance framework featuring a unified performance model coupled with a fast execution-plan search mechanism to enable real-time optimal recovery strategy selection; it synergistically integrates redundant computation, dynamic parallelism, and data rerouting, augmented with communication optimization and fine-grained performance estimation to eliminate inherent overheads of prior approaches. Contribution/Results: Evaluated on a 32-GPU cluster, the framework limits post-failure performance degradation to ≤11.00%, achieves 1.229× and 1.355× higher average throughput than Oobleck and Recycle, respectively, and preserves model convergence and memory efficiency.

Technology Category

Application Category

📝 Abstract
Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Odyssey, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Odyssey achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Odyssey maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Odyssey achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.
Problem

Research questions and friction points this paper is trying to address.

Adaptively selecting optimal recovery strategies for distributed training
Minimizing performance penalties from frequent training interruptions
Maintaining high throughput and convergence despite system failures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive policy selection for optimal recovery strategies
Unified performance model with expedient plan search
Efficient communication optimizations ensuring minimal overhead
🔎 Similar Papers
No similar papers found.
Y
Yuhang Zhou
State Key Laboratory for Novel Software Technology, Nanjing University, China
Zhibin Wang
Zhibin Wang
Zhejiang University
new particle formationaerosolshygroscopicityblack carbon
P
Peng Jiang
State Key Laboratory for Novel Software Technology, Nanjing University, China
H
Haoran Xia
State Key Laboratory for Novel Software Technology, Nanjing University, China
J
Junhe Lu
State Key Laboratory for Novel Software Technology, Nanjing University, China
Q
Qianyu Jiang
State Key Laboratory for Novel Software Technology, Nanjing University, China
Rong Gu
Rong Gu
Mälardalen University
Formal MethodsMachine LearningAutonomous Systems
H
Hengxi Xu
Huawei, China
X
Xinjing Huang
Huawei, China
G
Guanghuan Fang
Huawei, China
Z
Zhiheng Hu
Huawei, China
J
Jingyi Zhang
Huawei, China
Y
Yongjin Cai
Huawei, China
J
Jian He
Huawei, China
Chen Tian
Chen Tian
Prof. of Nanjing University
Data Center NetworkingNetwork Function VirtualisationContent Distribution