TrainMover: An Interruption-Resilient and Reliable ML Training Runtime

📅 2024-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale ML training is frequently interrupted by hardware/software failures or operational events; existing checkpointing and reconfiguration approaches suffer from prolonged downtime, performance degradation, or policy drift. This paper proposes a highly reliable training runtime system featuring two novel mechanisms: (i) a two-stage incremental communication group reconstruction and (ii) a communication-free sandboxed shadow iteration. These enable sub-second fault migration and continuous training with zero memory overhead. Key technical contributions include delta-based communication group management, a lightweight state synchronization protocol, coordinated standby node scheduling, and sandboxed shadow iterations. Experiments demonstrate: (i) all model migrations incur ≤1 second of downtime; (ii) training efficiency reaches 99% under periodic load balancing every 10 minutes; and (iii) the system robustly handles diverse real-world interruption scenarios—including node failures, network partitions, and maintenance-induced evictions—without accuracy loss or manual intervention.

Technology Category

Application Category

📝 Abstract
Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpointing or runtime reconfiguration suffer from long downtimes, degraded performance, or undesired changes to training strategies. We present TrainMover, a resilient runtime that leverages standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces two key techniques: two-phase, delta-based communication group setups and communication-free sandboxed shadow iterations. Our evaluation shows that TrainMover consistently achieves second-level downtime across all evaluated models during migration, maintaining 99% training efficiency during periodic 10-minute rebalancing. We also demonstrate the effectiveness of TrainMover in handling various interruptions.
Problem

Research questions and friction points this paper is trying to address.

Minimize downtime in large-scale ML training interruptions
Maintain high training efficiency during hardware/software failures
Avoid performance degradation in resilient ML runtime systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages standby machines for minimal downtime
Uses two-phase delta-based communication groups
Implements communication-free sandboxed shadow iterations
🔎 Similar Papers
No similar papers found.