RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure

📅 2025-12-27

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address the high synchronization overhead and low resource utilization in Agentic Reinforcement Learning (Agentic RL) training on heterogeneous hardware—characterized by GPU-based prefilling, bandwidth-constrained decoding, and CPU-intensive environment simulation—this paper proposes the first high-throughput distributed training system designed for decoupled infrastructure. Our method introduces three core innovations: trajectory-level fine-grained asynchronous execution, hardware-aware task mapping, and state-aware serverless computation offloading. Leveraging GPU heterogeneous scheduling, trajectory-level pipeline parallelism, serverless reward model deployment, and cross-layer optimization across CPU, GPU, and NVLink, our system achieves 1.35–2.05× end-to-end training speedup over monolithic synchronous baselines. We successfully scale training to exa-scale Mixture-of-Experts (MoE) models on an Alibaba cluster exceeding 3,000 GPUs, demonstrating robustness and strong scalability at massive scale.

Technology Category

Application Category

📝 Abstract

Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning. Unlike standard LLM post-training, agentic RL workloads are highly heterogeneous, combining compute-intensive prefill phases, bandwidth-bound decoding, and stateful, CPU-heavy environment simulations. We argue that efficient agentic RL training requires disaggregated infrastructure to leverage specialized, best-fit hardware. However, naive disaggregation introduces substantial synchronization overhead and resource underutilization due to the complex dependencies between stages. We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure. RollArc is built on three core principles: (1) hardware-affinity workload mapping, which routes compute-bound and bandwidth-bound tasks to bestfit GPU devices, (2) fine-grained asynchrony, which manages execution at the trajectory level to mitigate resource bubbles, and (3) statefulness-aware computation, which offloads stateless components (e.g., reward models) to serverless infrastructure for elastic scaling. Our results demonstrate that RollArc effectively improves training throughput and achieves 1.35-2.05( imes) end-to-end training time reduction compared to monolithic and synchronous baselines. We also evaluate RollArc by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with more than 3,000 GPUs, further demonstrating RollArc scalability and robustness. The code is available at https://github.com/alibaba/ROLL.

Problem

Research questions and friction points this paper is trying to address.

Optimizing heterogeneous agentic RL workloads on disaggregated hardware

Minimizing synchronization overhead in distributed RL training systems

Improving throughput for large-scale multi-task RL model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregated infrastructure leverages specialized hardware for efficiency

Hardware-affinity mapping routes tasks to best-fit GPU devices

Fine-grained asynchrony manages execution at trajectory level

🔎 Similar Papers

No similar papers found.