AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language model (LLM) reinforcement learning (RL) systems predominantly adopt synchronous batched paradigms, where the rollout phase must wait for the longest sample to complete—resulting in low GPU utilization and suboptimal training efficiency. This work introduces the first fully asynchronous RL system tailored for language reasoning: it decouples rollout and training pipelines to enable continuous parallel execution; proposes a staleness-controllable dynamic load balancing mechanism; designs a staleness-aware PPO variant that enhances convergence robustness without compromising training stability; and integrates an efficient rollout caching scheme with optimized batch scheduling. Evaluated on mathematical and code reasoning benchmarks, the system achieves up to 2.57× training speedup while matching or exceeding the final performance of synchronous baselines under identical GPU resource constraints.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers from severe system-level inefficiency. Generation must wait until the longest output in the batch is completed before model update, resulting in GPU underutilization. We present AReaL, a emph{fully asynchronous} RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves extbf{up to 2.57$ imes$ training speedup} compared to the best synchronous systems with the same number of GPUs and matched or even improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.
Problem

Research questions and friction points this paper is trying to address.

Efficient asynchronous RL system for large language models
Decouples generation and training to improve GPU utilization
Stabilizes RL training while handling outdated samples effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully asynchronous RL system decouples generation and training
Balances workload to control data staleness effectively
Uses staleness-enhanced PPO variant for outdated samples
🔎 Similar Papers
No similar papers found.
W
Wei Fu
IIIS, Tsinghua University
Jiaxuan Gao
Jiaxuan Gao
Institute for Interdisciplinary Information Sciences, Tsinghua University
multi-agent reinforcement learninglarge language model
X
Xujie Shen
Ant Research
C
Chen Zhu
Ant Research
Zhiyu Mei
Zhiyu Mei
Tsinghua University
Computer Science
Chuyi He
Chuyi He
Ant Group
Artificial Intelligence
Shusheng Xu
Shusheng Xu
IIIS, Tsinghua University
Reinforcement learningNLPData mining
G
Guo Wei
Ant Research
J
Jun Mei
Ant Research
J
Jiashu Wang
HKUST
T
Tongkai Yang
Ant Research
B
Binhang Yuan
HKUST
Y
Yi Wu
IIIS, Tsinghua University