π€ AI Summary
This work proposes a dataflow-oriented reinforcement learning system that overcomes the limitations of existing large language model RL frameworks, which struggle to efficiently support multi-policy co-training, elastic and heterogeneous execution across regions, and require extensive custom development for scaling. By decoupling rollout generation, dataflow management, and training into autonomous components, the system abandons the conventional trainer-centric architecture. This design natively enables multi-policy collaboration, elastic scaling, cross-regional heterogeneous execution, and composable data processing pipelinesβall without modifying the system code. Experiments demonstrate that the proposed system achieves up to 2.7Γ faster training on tasks spanning mathematics, code generation, search, and AgentBench, while attaining comparable or superior accuracy.
π Abstract
Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.