🤖 AI Summary
To address the high synchronization overhead and low resource utilization in Agentic Reinforcement Learning (Agentic RL) training on heterogeneous hardware—characterized by GPU-based prefilling, bandwidth-constrained decoding, and CPU-intensive environment simulation—this paper proposes the first high-throughput distributed training system designed for decoupled infrastructure. Our method introduces three core innovations: trajectory-level fine-grained asynchronous execution, hardware-aware task mapping, and state-aware serverless computation offloading. Leveraging GPU heterogeneous scheduling, trajectory-level pipeline parallelism, serverless reward model deployment, and cross-layer optimization across CPU, GPU, and NVLink, our system achieves 1.35–2.05× end-to-end training speedup over monolithic synchronous baselines. We successfully scale training to exa-scale Mixture-of-Experts (MoE) models on an Alibaba cluster exceeding 3,000 GPUs, demonstrating robustness and strong scalability at massive scale.
📝 Abstract
Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning. Unlike standard LLM post-training, agentic RL workloads are highly heterogeneous, combining compute-intensive prefill phases, bandwidth-bound decoding, and stateful, CPU-heavy environment simulations. We argue that efficient agentic RL training requires disaggregated infrastructure to leverage specialized, best-fit hardware. However, naive disaggregation introduces substantial synchronization overhead and resource underutilization due to the complex dependencies between stages.
We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure. RollArc is built on three core principles: (1) hardware-affinity workload mapping, which routes compute-bound and bandwidth-bound tasks to bestfit GPU devices, (2) fine-grained asynchrony, which manages execution at the trajectory level to mitigate resource bubbles, and (3) statefulness-aware computation, which offloads stateless components (e.g., reward models) to serverless infrastructure for elastic scaling. Our results demonstrate that RollArc effectively improves training throughput and achieves 1.35-2.05( imes) end-to-end training time reduction compared to monolithic and synchronous baselines. We also evaluate RollArc by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with more than 3,000 GPUs, further demonstrating RollArc scalability and robustness. The code is available at https://github.com/alibaba/ROLL.