🤖 AI Summary
This work addresses the challenge of efficiently supporting frequent parallelism reconfigurations in large language model training, which arise from resource fluctuations and cluster elasticity but are poorly handled by existing static frameworks. The authors propose DynaTrain, a novel system that introduces the Virtual Parameter Space (VPS) abstraction to map multidimensional parallelism configurations into deterministic geometric structures. By leveraging state routing and migration mechanisms, DynaTrain enables sub-second strategy switching. Integrated with memory-aware scheduling, deadlock-free communication, and overlapping of topology changes, the system achieves dynamic reconfiguration in under 2 seconds for a 70B dense model and 4.36 seconds for a 235B MoE model—accelerating reconfiguration by up to three orders of magnitude over prior approaches while preserving training correctness and elasticity.
📝 Abstract
Modern large language model (LLM) training is inherently dynamic: resource fluctuations, RLHF phase shifts, and cluster elasticity continually reshape the optimal parallelism layout, posing a significant challenge to existing training frameworks built around a static execution model. We present DynaTrain, a distributed training system for sub-second, online reconfiguration across arbitrary multi-dimensional parallelism. At its core, we propose a Virtual Parameter Space (VPS) abstraction that unifies all distributed training states under one logical coordinate space, turning any parallelism configuration into a deterministic mapping and collapsing complex transition into manageable geometric intersections. On top of VPS, a state routing-and-transition layer executes rank-local transfers under a memory-aware, deadlock-free schedule, and an Elastic Device Manager overlaps new-world construction with ongoing training to mask topology-change cost. On dense and MoE models up to 235B parameters, DynaTrain reconfigures a 70B dense model in under 2s and a 235B MoE model in 4.36s, outperforming state-of-the-art checkpoint-based and elastic systems by up to three orders of magnitude while preserving correctness.