Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Synchronous reinforcement learning (RL) post-training is critical for enhancing the diverse capabilities of large language models (LLMs), yet existing systems suffer from low resource utilization and poor scalability. To address these limitations, we propose a fully asynchronous RL post-training architecture grounded in two core design principles: fine-grained parallelism and decoupling of rollout generation from policy training. Our architecture integrates key techniques—including asynchronous execution, queue-based scheduling, environment-level parallelism, and off-policy algorithm support—to maximize hardware efficiency and flexibility. It enables scalable training for both RL-based value ranking (RLVR) and agent-oriented tasks. Experiments demonstrate that, under identical GPU resources, our approach achieves 2.24× and 2.72× speedups for RLVR and agent tasks, respectively, while matching the convergence performance of synchronous baselines.

Technology Category

Application Category

📝 Abstract

Synchronous Reinforcement Learning (RL) post-training has emerged as a crucial step for enhancing Large Language Models (LLMs) with diverse capabilities. However, many systems designed to accelerate RL post-training still suffer from low resource utilization and limited scalability. We present ROLL Flash, a system that extends ROLL with native support for asynchronous RL post-training. ROLL Flash is built upon two core design principles: fine-grained parallelism and rollout-train decoupling. Guided by these principles, ROLL Flash provides flexible programming interfaces that enable a fully asynchronous training architecture and support efficient rollout mechanisms, including queue scheduling and environment-level asynchronous execution. Through comprehensive theoretical analysis and extensive experiments, we demonstrate that ROLL Flash significantly improves resource utilization and scalability over synchronous RL post-training. ROLL Flash achieves up to 2.24x speedup on RLVR tasks and 2.72x on agentic tasks, using the same GPU budget as synchronous baselines. Furthermore, we implement several popular off-policy algorithms and verify that asynchronous training can achieve performance on par with synchronous training.

Problem

Research questions and friction points this paper is trying to address.

Improving low resource utilization in RL post-training systems

Addressing limited scalability of synchronous RL training methods

Accelerating RLVR and agentic training through asynchronous execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous RL post-training with fine-grained parallelism

Rollout-train decoupling with queue scheduling mechanism

Environment-level asynchronous execution for improved scalability

🔎 Similar Papers

No similar papers found.