RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure

📅 2025-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high synchronization overhead and low resource utilization in Agentic Reinforcement Learning (Agentic RL) training on heterogeneous hardware—characterized by GPU-based prefilling, bandwidth-constrained decoding, and CPU-intensive environment simulation—this paper proposes the first high-throughput distributed training system designed for decoupled infrastructure. Our method introduces three core innovations: trajectory-level fine-grained asynchronous execution, hardware-aware task mapping, and state-aware serverless computation offloading. Leveraging GPU heterogeneous scheduling, trajectory-level pipeline parallelism, serverless reward model deployment, and cross-layer optimization across CPU, GPU, and NVLink, our system achieves 1.35–2.05× end-to-end training speedup over monolithic synchronous baselines. We successfully scale training to exa-scale Mixture-of-Experts (MoE) models on an Alibaba cluster exceeding 3,000 GPUs, demonstrating robustness and strong scalability at massive scale.

Technology Category

Application Category

📝 Abstract
Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning. Unlike standard LLM post-training, agentic RL workloads are highly heterogeneous, combining compute-intensive prefill phases, bandwidth-bound decoding, and stateful, CPU-heavy environment simulations. We argue that efficient agentic RL training requires disaggregated infrastructure to leverage specialized, best-fit hardware. However, naive disaggregation introduces substantial synchronization overhead and resource underutilization due to the complex dependencies between stages. We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure. RollArc is built on three core principles: (1) hardware-affinity workload mapping, which routes compute-bound and bandwidth-bound tasks to bestfit GPU devices, (2) fine-grained asynchrony, which manages execution at the trajectory level to mitigate resource bubbles, and (3) statefulness-aware computation, which offloads stateless components (e.g., reward models) to serverless infrastructure for elastic scaling. Our results demonstrate that RollArc effectively improves training throughput and achieves 1.35-2.05( imes) end-to-end training time reduction compared to monolithic and synchronous baselines. We also evaluate RollArc by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with more than 3,000 GPUs, further demonstrating RollArc scalability and robustness. The code is available at https://github.com/alibaba/ROLL.
Problem

Research questions and friction points this paper is trying to address.

Optimizing heterogeneous agentic RL workloads on disaggregated hardware
Minimizing synchronization overhead in distributed RL training systems
Improving throughput for large-scale multi-task RL model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregated infrastructure leverages specialized hardware for efficiency
Hardware-affinity mapping routes tasks to best-fit GPU devices
Fine-grained asynchrony manages execution at trajectory level
🔎 Similar Papers
No similar papers found.
W
Wei Gao
HKUST
Yuheng Zhao
Yuheng Zhao
Fudan University
Data VisualizationVisual AnalyticsHuman-AI Collaboration
Tianyuan Wu
Tianyuan Wu
CSE Department, HKUST
ML SystemsReinforcement Learning
S
Shaopan Xiong
Alibaba Group
W
Weixun Wang
Alibaba Group
D
Dakai An
HKUST
L
Lunxi Cao
HKUST
Dilxat Muhtar
Dilxat Muhtar
Nanjing University
Computer VisionDeep LearningNautural Language Processing
Z
Zichen Liu
Alibaba Group
H
Haizhou Zhao
Alibaba Group
J
Ju Huang
Alibaba Group
S
Siran Yang
Alibaba Group
Y
Yongbin Li
Tongyi Lab, Alibaba
W
Wenbo Su
Alibaba Group
J
Jiamang Wang
Alibaba Group
L
Lin Qu
Alibaba Group
B
Bo Zheng
Alibaba Group
W
Wei Wang
HKUST