LiveR: Fine-Grained Elasticity via Live Reconfiguration for Model Training

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the high reconfiguration latency and severe training downtime incurred by existing elastic systems when leveraging volatile GPU resources—such as spot instances—for large model training, which typically rely on checkpoint-based restarts. The authors propose LiveR, the first runtime system enabling live reconfiguration of hybrid-parallel training without interruption. LiveR achieves this by asynchronously pre-warming new nodes, decoupling state migration from topology changes, and directly streaming model state reshaping over high-speed interconnects. Implemented atop Megatron-LM and PyTorch, LiveR reduces reconfiguration-induced downtime from minutes to seconds—a 14–23× speedup—while incurring negligible steady-state overhead and sustaining 99% of effective training throughput even under dynamic resource availability.

📝 Abstract

To reduce user costs and maximize cluster utilization, large model training increasingly leverages volatile but inexpensive GPU capacity, such as spot instances and reclaimable resources in shared clusters. Yet, capitalizing on these economic benefits requires jobs to adapt within the short warning windows that many such environments provide. Existing elastic training systems still treat reconfiguration as stop-and-restart: they externalize distributed state through checkpoints, rebuild the distributed runtime on a new topology, and restart training, turning each resize event into a storage-heavy recovery procedure that incurs substantial downtime from checkpoint I/O, process restart, CUDA initialization, and communicator setup. We present LiveR, a live reconfiguration runtime for elastic LLM training that replaces storage-backed restart with a live, bounded-memory handoff between mixed-parallel training worlds. While the current world continues training, LiveR asynchronously prepares the target world, bootstraps newly added workers in isolation to keep heavyweight initialization off the critical path, and streams model state directly over high-bandwidth interconnects while reshaping it online across tensor, pipeline, and data parallel dimensions. Once the target world is ready, LiveR performs a lightweight commit that switches training to the new configuration without stop-and-restart on the live path. We implement LiveR atop Megatron-LM and PyTorch and evaluate it end-to-end on a multi-node GPU cluster. Across diverse reconfiguration scenarios, LiveR reduces downtime from minutes to seconds, accelerates reconfiguration by 14$\times$-23$\times$ over checkpoint/restart baselines, incurs minimal steady-state overhead, and sustains up to 99% training goodput under volatile-resource conditions, making volatile low-cost GPU capacity far more practical for LLM training.

Problem

Research questions and friction points this paper is trying to address.

elastic training

live reconfiguration

volatile GPU resources

checkpoint/restart

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

live reconfiguration

elastic training

LLM