Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Synchronous reinforcement learning (RL) post-training is critical for enhancing the diverse capabilities of large language models (LLMs), yet existing systems suffer from low resource utilization and poor scalability. To address these limitations, we propose a fully asynchronous RL post-training architecture grounded in two core design principles: fine-grained parallelism and decoupling of rollout generation from policy training. Our architecture integrates key techniques—including asynchronous execution, queue-based scheduling, environment-level parallelism, and off-policy algorithm support—to maximize hardware efficiency and flexibility. It enables scalable training for both RL-based value ranking (RLVR) and agent-oriented tasks. Experiments demonstrate that, under identical GPU resources, our approach achieves 2.24× and 2.72× speedups for RLVR and agent tasks, respectively, while matching the convergence performance of synchronous baselines.

Technology Category

Application Category

📝 Abstract
Synchronous Reinforcement Learning (RL) post-training has emerged as a crucial step for enhancing Large Language Models (LLMs) with diverse capabilities. However, many systems designed to accelerate RL post-training still suffer from low resource utilization and limited scalability. We present ROLL Flash, a system that extends ROLL with native support for asynchronous RL post-training. ROLL Flash is built upon two core design principles: fine-grained parallelism and rollout-train decoupling. Guided by these principles, ROLL Flash provides flexible programming interfaces that enable a fully asynchronous training architecture and support efficient rollout mechanisms, including queue scheduling and environment-level asynchronous execution. Through comprehensive theoretical analysis and extensive experiments, we demonstrate that ROLL Flash significantly improves resource utilization and scalability over synchronous RL post-training. ROLL Flash achieves up to 2.24x speedup on RLVR tasks and 2.72x on agentic tasks, using the same GPU budget as synchronous baselines. Furthermore, we implement several popular off-policy algorithms and verify that asynchronous training can achieve performance on par with synchronous training.
Problem

Research questions and friction points this paper is trying to address.

Improving low resource utilization in RL post-training systems
Addressing limited scalability of synchronous RL training methods
Accelerating RLVR and agentic training through asynchronous execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous RL post-training with fine-grained parallelism
Rollout-train decoupling with queue scheduling mechanism
Environment-level asynchronous execution for improved scalability
🔎 Similar Papers
No similar papers found.
Han Lu
Han Lu
Alibaba Group
Z
Zichen Liu
Alibaba Group
S
Shaopan Xiong
Alibaba Group
Yancheng He
Yancheng He
Alibaba Group
LLM
W
Wei Gao
Hong Kong University of Science and Technology
Y
Yanan Wu
Alibaba Group
W
Weixun Wang
Alibaba Group
J
Jiashun Liu
Alibaba Group
Y
Yang Li
Alibaba Group
H
Haizhou Zhao
Alibaba Group
J
Ju Huang
Alibaba Group
S
Siran Yang
Alibaba Group
Xiaoyang Li
Xiaoyang Li
Southern University of Science and Technology
Integrated-sensing-communication-computationedge intelligencenetwork optimization
Y
Yijia Luo
Alibaba Group
Z
Zihe Liu
Alibaba Group
Ling Pan
Ling Pan
Assistant Professor, Hong Kong University of Science and Technology
Reinforcement LearningMulti-Agent Systems
Junchi Yan
Junchi Yan
FIAPR & ICML Board Member, SJTU (2018-), SII (2024-), AWS (2019-2022), IBM (2011-2018)
Computational IntelligenceAI4ScienceMachine LearningAutonomous Driving
W
Wei Wang
Hong Kong University of Science and Technology
W
Wenbo Su
Alibaba Group
J
Jiamang Wang
Alibaba Group
L
Lin Qu
Alibaba Group
B
Bo Zheng
Alibaba Group