🤖 AI Summary
Synchronous reinforcement learning (RL) post-training is critical for enhancing the diverse capabilities of large language models (LLMs), yet existing systems suffer from low resource utilization and poor scalability. To address these limitations, we propose a fully asynchronous RL post-training architecture grounded in two core design principles: fine-grained parallelism and decoupling of rollout generation from policy training. Our architecture integrates key techniques—including asynchronous execution, queue-based scheduling, environment-level parallelism, and off-policy algorithm support—to maximize hardware efficiency and flexibility. It enables scalable training for both RL-based value ranking (RLVR) and agent-oriented tasks. Experiments demonstrate that, under identical GPU resources, our approach achieves 2.24× and 2.72× speedups for RLVR and agent tasks, respectively, while matching the convergence performance of synchronous baselines.
📝 Abstract
Synchronous Reinforcement Learning (RL) post-training has emerged as a crucial step for enhancing Large Language Models (LLMs) with diverse capabilities. However, many systems designed to accelerate RL post-training still suffer from low resource utilization and limited scalability. We present ROLL Flash, a system that extends ROLL with native support for asynchronous RL post-training. ROLL Flash is built upon two core design principles: fine-grained parallelism and rollout-train decoupling. Guided by these principles, ROLL Flash provides flexible programming interfaces that enable a fully asynchronous training architecture and support efficient rollout mechanisms, including queue scheduling and environment-level asynchronous execution. Through comprehensive theoretical analysis and extensive experiments, we demonstrate that ROLL Flash significantly improves resource utilization and scalability over synchronous RL post-training. ROLL Flash achieves up to 2.24x speedup on RLVR tasks and 2.72x on agentic tasks, using the same GPU budget as synchronous baselines. Furthermore, we implement several popular off-policy algorithms and verify that asynchronous training can achieve performance on par with synchronous training.