π€ AI Summary
The reinforcement learning (RL) training procedures of state-of-the-art reasoning-oriented large language models (e.g., OpenAI o1, DeepSeek R1) remain undisclosed, hindering reproducibility and community advancement.
Method: We propose DAPO (Decoupled Adaptive Policy Optimization), the first RL algorithm for LLMs that decouples gradient clipping from dynamic action sampling. Built upon the Verl framework, DAPO integrates a PPO variant, adaptive KL regularization, hierarchical reward modeling, and inference trajectory resampling.
Contribution/Results: We fully open-source the training code, a high-quality dataset, and all hyperparameter configurations. On Qwen2.5-32B, DAPO achieves an AIME 2024 score of 50βthe highest among publicly available RL-based LLM training methods to date. This work establishes a scalable, transparent, and reproducible foundation for RL training of large language models, significantly advancing both accessibility and extensibility in the field.
π Abstract
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $ extbf{D}$ecoupled Clip and $ extbf{D}$ynamic s$ extbf{A}$mpling $ extbf{P}$olicy $ extbf{O}$ptimization ($ extbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.