🤖 AI Summary
Large-scale ML training is frequently interrupted by hardware/software failures or operational events; existing checkpointing and reconfiguration approaches suffer from prolonged downtime, performance degradation, or policy drift. This paper proposes a highly reliable training runtime system featuring two novel mechanisms: (i) a two-stage incremental communication group reconstruction and (ii) a communication-free sandboxed shadow iteration. These enable sub-second fault migration and continuous training with zero memory overhead. Key technical contributions include delta-based communication group management, a lightweight state synchronization protocol, coordinated standby node scheduling, and sandboxed shadow iterations. Experiments demonstrate: (i) all model migrations incur ≤1 second of downtime; (ii) training efficiency reaches 99% under periodic load balancing every 10 minutes; and (iii) the system robustly handles diverse real-world interruption scenarios—including node failures, network partitions, and maintenance-induced evictions—without accuracy loss or manual intervention.
📝 Abstract
Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpointing or runtime reconfiguration suffer from long downtimes, degraded performance, or undesired changes to training strategies. We present TrainMover, a resilient runtime that leverages standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces two key techniques: two-phase, delta-based communication group setups and communication-free sandboxed shadow iterations. Our evaluation shows that TrainMover consistently achieves second-level downtime across all evaluated models during migration, maintaining 99% training efficiency during periodic 10-minute rebalancing. We also demonstrate the effectiveness of TrainMover in handling various interruptions.