RollMux: Phase-Level Multiplexing for Disaggregated RL Post-Training

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reinforcement learning (RL) post-training, decoupling rollout and training improves hardware specialization but introduces severe inter-cluster idle time (“bubbles”) due to on-policy synchronization. Method: We propose RollMux, a cross-cluster cooperative scheduling framework featuring a novel *co-execution group* abstraction and a two-layer scheduler enabling phase-level multiplexing. It enforces *residency constraints* to keep model states persistently resident in memory, enabling low-overhead “warm-start” context switching. RollMux integrates conservative stochastic planning, provably optimal polling-based scheduling, locality-domain isolation, and GPU-cluster-wide coordinated orchestration. Results: Evaluated on a production-scale platform with 328×H20 and 328×H800 GPUs, RollMux achieves 1.84× higher cost efficiency than standard decoupled execution and 1.38× over the state-of-the-art co-location baseline, while meeting SLOs at 100% attainment.

Technology Category

Application Category

📝 Abstract
Rollout-training disaggregation is emerging as the standard architecture for Reinforcement Learning (RL) post-training, where memory-bound rollout and compute-bound training are physically disaggregated onto purpose-built clusters to maximize hardware efficiency. However, the strict synchronization required by on-policy algorithms introduces severe dependency bubbles, forcing one cluster to idle while the dependent phase is running on the other. We present RollMux, a cluster scheduling framework that reclaims these bubbles through cross-cluster orchestration. RollMux is built on the insight that the structural idleness of one job can be effectively utilized by the active phase of another. To realize this, we introduce the co-execution group abstraction, which partitions the cluster into isolated locality domains. This abstraction enables a two-tier scheduling architecture: an inter-group scheduler that optimizes job placement using conservative stochastic planning, and an intra-group scheduler that orchestrates a provably optimal round-robin schedule. The group abstraction also imposes a residency constraint, ensuring that massive model states remain cached in host memory to enable "warm-star" context switching. We evaluate RollMux on a production-scale testbed with 328 H20 and 328 H800 GPUs. RollMux improves cost efficiency by 1.84x over standard disaggregation and 1.38x over state-of-the-art co-located baselines, all while achieving 100% SLO attainment.
Problem

Research questions and friction points this paper is trying to address.

Reduces idle time in disaggregated RL clusters
Enables cross-cluster job scheduling for efficiency
Improves hardware utilization via phase-level multiplexing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-cluster orchestration reclaims dependency bubbles
Two-tier scheduling with inter-group and intra-group optimization
Co-execution groups enable warm-star context switching
🔎 Similar Papers
No similar papers found.
Tianyuan Wu
Tianyuan Wu
CSE Department, HKUST
ML SystemsReinforcement Learning
L
Lunxi Cao
Hong Kong University of Science and Technology
Y
Yining Wei
UIUC
W
Wei Gao
Hong Kong University of Science and Technology
Yuheng Zhao
Yuheng Zhao
Fudan University
Data VisualizationVisual AnalyticsHuman-AI Collaboration
D
Dakai An
Alibaba Group
S
Shaopan Xiong
Alibaba Group
Zhiqiang Lv
Zhiqiang Lv
Didi Chuxing Technology Company
J
Ju Huang
Alibaba Group
S
Siran Yang
Alibaba Group
Yinghao Yu
Yinghao Yu
Engineer, Alibaba
Resource management in containerized clustersGeneration optimizations for distributed systems
J
Jiamang Wang
Alibaba Group
L
Lin Qu
Alibaba Group
W
Wei Wang
Hong Kong University of Science and Technology