VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three key challenges in value-based reinforcement learning for long-chain chain-of-thought (long-CoT) reasoning: value model bias, sequence-length heterogeneity, and reward sparsity. We propose Value-Augmented Proximal Policy Optimization (VAPO), a novel framework that introduces a reasoning-process-oriented value augmentation paradigm. VAPO integrates sequence-adaptive truncation and normalization, reward shaping, policy constraints, and gradient variance control. Fine-tuned on Qwen-32B, VAPO achieves exceptional training stability—zero crashes—and rapid convergence within 5,000 steps. On the AIME 2024 benchmark, it attains a new state-of-the-art score of 60.4, outperforming DeepSeek-R1-Zero-Qwen-32B and DAPO by over 10 points. This advancement significantly enhances the efficiency and robustness of long-CoT training, marking a substantial step toward scalable and reliable reasoning optimization.

Technology Category

Application Category

📝 Abstract
We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of $mathbf{60.4}$. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses value model bias in reinforcement learning.
Handles heterogeneous sequence lengths in reasoning tasks.
Mitigates sparse reward signals for stable training.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Value-based Augmented Proximal Policy Optimization
Built on Qwen 32B pre-trained model
Addresses value model bias and reward sparsity
🔎 Similar Papers
No similar papers found.