Odyssey: Adaptive Policy Selection for Resilient Distributed Training

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Distributed training of large language models (LLMs) frequently suffers from hardware/software failures, and existing fault-tolerance techniques—such as redundant computation, dynamic parallelism, and data rerouting—impose persistent overhead, incur long recovery latency, or degrade throughput. Method: This paper proposes an adaptive fault-tolerance framework featuring a unified performance model coupled with a fast execution-plan search mechanism to enable real-time optimal recovery strategy selection; it synergistically integrates redundant computation, dynamic parallelism, and data rerouting, augmented with communication optimization and fine-grained performance estimation to eliminate inherent overheads of prior approaches. Contribution/Results: Evaluated on a 32-GPU cluster, the framework limits post-failure performance degradation to ≤11.00%, achieves 1.229× and 1.355× higher average throughput than Oobleck and Recycle, respectively, and preserves model convergence and memory efficiency.

Technology Category

Application Category

📝 Abstract

Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Odyssey, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Odyssey achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Odyssey maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Odyssey achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.

Problem

Research questions and friction points this paper is trying to address.

Adaptively selecting optimal recovery strategies for distributed training

Minimizing performance penalties from frequent training interruptions

Maintaining high throughput and convergence despite system failures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive policy selection for optimal recovery strategies

Unified performance model with expedient plan search

Efficient communication optimizations ensuring minimal overhead

🔎 Similar Papers

Adaptive Gradient Clipping for Robust Federated Learning