DAPO: An Open-Source LLM Reinforcement Learning System at Scale

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

The reinforcement learning (RL) training procedures of state-of-the-art reasoning-oriented large language models (e.g., OpenAI o1, DeepSeek R1) remain undisclosed, hindering reproducibility and community advancement. Method: We propose DAPO (Decoupled Adaptive Policy Optimization), the first RL algorithm for LLMs that decouples gradient clipping from dynamic action sampling. Built upon the Verl framework, DAPO integrates a PPO variant, adaptive KL regularization, hierarchical reward modeling, and inference trajectory resampling. Contribution/Results: We fully open-source the training code, a high-quality dataset, and all hyperparameter configurations. On Qwen2.5-32B, DAPO achieves an AIME 2024 score of 50—the highest among publicly available RL-based LLM training methods to date. This work establishes a scalable, transparent, and reproducible foundation for RL training of large language models, significantly advancing both accessibility and extensibility in the field.

Technology Category

Application Category

📝 Abstract

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $ extbf{D}$ecoupled Clip and $ extbf{D}$ynamic s$ extbf{A}$mpling $ extbf{P}$olicy $ extbf{O}$ptimization ($ extbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

Problem

Research questions and friction points this paper is trying to address.

Reproducing RL training results for state-of-the-art reasoning LLMs

Developing an open-source large-scale RL system for LLMs

Enhancing reproducibility and supporting future LLM RL research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Clip and Dynamic Sampling Policy Optimization

Open-source large-scale RL system with Qwen2.5-32B

Four key techniques for successful LLM RL training

🔎 Similar Papers

Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning