🤖 AI Summary
In reinforcement learning (RL) post-training, decoupling rollout and training improves hardware specialization but introduces severe inter-cluster idle time (“bubbles”) due to on-policy synchronization.
Method: We propose RollMux, a cross-cluster cooperative scheduling framework featuring a novel *co-execution group* abstraction and a two-layer scheduler enabling phase-level multiplexing. It enforces *residency constraints* to keep model states persistently resident in memory, enabling low-overhead “warm-start” context switching. RollMux integrates conservative stochastic planning, provably optimal polling-based scheduling, locality-domain isolation, and GPU-cluster-wide coordinated orchestration.
Results: Evaluated on a production-scale platform with 328×H20 and 328×H800 GPUs, RollMux achieves 1.84× higher cost efficiency than standard decoupled execution and 1.38× over the state-of-the-art co-location baseline, while meeting SLOs at 100% attainment.
📝 Abstract
Rollout-training disaggregation is emerging as the standard architecture for Reinforcement Learning (RL) post-training, where memory-bound rollout and compute-bound training are physically disaggregated onto purpose-built clusters to maximize hardware efficiency. However, the strict synchronization required by on-policy algorithms introduces severe dependency bubbles, forcing one cluster to idle while the dependent phase is running on the other. We present RollMux, a cluster scheduling framework that reclaims these bubbles through cross-cluster orchestration. RollMux is built on the insight that the structural idleness of one job can be effectively utilized by the active phase of another. To realize this, we introduce the co-execution group abstraction, which partitions the cluster into isolated locality domains. This abstraction enables a two-tier scheduling architecture: an inter-group scheduler that optimizes job placement using conservative stochastic planning, and an intra-group scheduler that orchestrates a provably optimal round-robin schedule. The group abstraction also imposes a residency constraint, ensuring that massive model states remain cached in host memory to enable "warm-star" context switching. We evaluate RollMux on a production-scale testbed with 328 H20 and 328 H800 GPUs. RollMux improves cost efficiency by 1.84x over standard disaggregation and 1.38x over state-of-the-art co-located baselines, all while achieving 100% SLO attainment.