TrainMover: An Interruption-Resilient and Reliable ML Training Runtime

📅 2024-12-17

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Large-scale ML training is frequently interrupted by hardware/software failures or operational events; existing checkpointing and reconfiguration approaches suffer from prolonged downtime, performance degradation, or policy drift. This paper proposes a highly reliable training runtime system featuring two novel mechanisms: (i) a two-stage incremental communication group reconstruction and (ii) a communication-free sandboxed shadow iteration. These enable sub-second fault migration and continuous training with zero memory overhead. Key technical contributions include delta-based communication group management, a lightweight state synchronization protocol, coordinated standby node scheduling, and sandboxed shadow iterations. Experiments demonstrate: (i) all model migrations incur ≤1 second of downtime; (ii) training efficiency reaches 99% under periodic load balancing every 10 minutes; and (iii) the system robustly handles diverse real-world interruption scenarios—including node failures, network partitions, and maintenance-induced evictions—without accuracy loss or manual intervention.

Technology Category

Application Category

📝 Abstract

Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpointing or runtime reconfiguration suffer from long downtimes, degraded performance, or undesired changes to training strategies. We present TrainMover, a resilient runtime that leverages standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces two key techniques: two-phase, delta-based communication group setups and communication-free sandboxed shadow iterations. Our evaluation shows that TrainMover consistently achieves second-level downtime across all evaluated models during migration, maintaining 99% training efficiency during periodic 10-minute rebalancing. We also demonstrate the effectiveness of TrainMover in handling various interruptions.

Problem

Research questions and friction points this paper is trying to address.

Minimize downtime in large-scale ML training interruptions

Maintain high training efficiency during hardware/software failures

Avoid performance degradation in resilient ML runtime systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages standby machines for minimal downtime

Uses two-phase delta-based communication groups

Implements communication-free sandboxed shadow iterations

🔎 Similar Papers

Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization