🤖 AI Summary
Current large language model (LLM) reinforcement learning (RL) systems predominantly adopt synchronous batched paradigms, where the rollout phase must wait for the longest sample to complete—resulting in low GPU utilization and suboptimal training efficiency. This work introduces the first fully asynchronous RL system tailored for language reasoning: it decouples rollout and training pipelines to enable continuous parallel execution; proposes a staleness-controllable dynamic load balancing mechanism; designs a staleness-aware PPO variant that enhances convergence robustness without compromising training stability; and integrates an efficient rollout caching scheme with optimized batch scheduling. Evaluated on mathematical and code reasoning benchmarks, the system achieves up to 2.57× training speedup while matching or exceeding the final performance of synchronous baselines under identical GPU resource constraints.
📝 Abstract
Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers from severe system-level inefficiency. Generation must wait until the longest output in the batch is completed before model update, resulting in GPU underutilization. We present AReaL, a emph{fully asynchronous} RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves extbf{up to 2.57$ imes$ training speedup} compared to the best synchronous systems with the same number of GPUs and matched or even improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.