AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
Current large language model (LLM) reinforcement learning (RL) systems predominantly adopt synchronous batched paradigms, where the rollout phase must wait for the longest sample to complete—resulting in low GPU utilization and suboptimal training efficiency. This work introduces the first fully asynchronous RL system tailored for language reasoning: it decouples rollout and training pipelines to enable continuous parallel execution; proposes a staleness-controllable dynamic load balancing mechanism; designs a staleness-aware PPO variant that enhances convergence robustness without compromising training stability; and integrates an efficient rollout caching scheme with optimized batch scheduling. Evaluated on mathematical and code reasoning benchmarks, the system achieves up to 2.57× training speedup while matching or exceeding the final performance of synchronous baselines under identical GPU resource constraints.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers from severe system-level inefficiency. Generation must wait until the longest output in the batch is completed before model update, resulting in GPU underutilization. We present AReaL, a emph{fully asynchronous} RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves extbf{up to 2.57$ imes$ training speedup} compared to the best synchronous systems with the same number of GPUs and matched or even improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.
Problem

Research questions and friction points this paper is trying to address.

Efficient asynchronous RL system for large language models
Decouples generation and training to improve GPU utilization
Stabilizes RL training while handling outdated samples effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully asynchronous RL system decouples generation and training
Balances workload to control data staleness effectively
Uses staleness-enhanced PPO variant for outdated samples