🤖 AI Summary
This work addresses the low utilization of external resources such as CPUs and GPUs in agentic reinforcement learning within cloud clusters, a problem exacerbated by static resource reservation and task isolation. To overcome this limitation, the authors propose a novel action-level resource orchestration mechanism that introduces a unified modeling and scheduling paradigm at the action granularity. A dedicated manager tailored to heterogeneous resource topologies is designed to enable fine-grained resource sharing and elastic scheduling, minimizing action completion time while respecting heterogeneity constraints. Experimental results demonstrate that the proposed approach reduces action completion time by up to 4.3×, accelerates RL training steps by 1.5×, and decreases external resource consumption by 71.2%. The method has been successfully deployed in the training of the MiMo series of models.
📝 Abstract
Agentic reinforcement learning (RL) has emerged as a transformative workload in cloud clusters, enabling large language models (LLMs) to solve complex problems through interactions with real world. However, unlike traditional RL, agentic RL demands substantial external cloud resources, e.g., CPUs for code execution and GPUs for reward models, that exist outside the primary training cluster. Existing agentic RL framework typically rely on static over-provisioning, i.e., resources are often tied to long-lived trajectories or isolated by tasks, which leads to severe resource inefficiency.
We propose the action-level orchestration, and incorporate it into ARL-Tangram, a unified resource management system that enables fine-grained external resource sharing and elasticity. ARL-Tangram utilizes a unified action-level formulation and an elastic scheduling algorithm to minimize action completion time (ACT) while satisfying heterogeneous resource constraints. Further, heterogeneous resource managers are tailored to efficiently support the action-level execution on resources with heterogeneous characteristics and topologies. Evaluation on real-world agentic RL tasks demonstrates that ARL-Tangram improves average ACT by up to 4.3$\times$, speeds up the step duration of RL training by up to 1.5$\times$, and saves the external resources by up to 71.2$\%$. This system has been deployed to support the training of the MiMo series models.