AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Reinforcement learning (RL) for mathematical reasoning with large language models (LLMs) suffers from low sample efficiency due to scarcity of chain-of-thought data; existing Group Relative Policy Optimization (GRPO) methods exhibit vanishing gradients when advantage estimates approach zero, severely impairing convergence speed and training stability. Method: We propose Advantage-Augmented Policy Optimization (AAPO), a value-free policy optimization algorithm that introduces a novel momentum-based advantage augmentation mechanism within the group-relative estimation framework to mitigate gradient degeneration. AAPO directly optimizes the policy using cross-entropy loss, enhancing both stability and sample efficiency. Results: On multiple mathematical reasoning benchmarks, AAPO achieves an average 3.2% absolute improvement in reasoning accuracy over GRPO and PPO, while improving training sample efficiency by approximately 40%.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited chain-of-thought (CoT) data. Among RL-based post-training methods, group relative advantage estimation, as exemplified by Group Relative Policy Optimization (GRPO), has attracted considerable attention for eliminating the dependency on the value model, thereby simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO). However, we observe that exsiting group relative advantage estimation method still suffers from training inefficiencies, particularly when the estimated advantage approaches zero. To address this limitation, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the cross-entropy (CE) loss using advantages enhanced through a momentum-based estimation scheme. This approach effectively mitigates the inefficiencies associated with group relative advantage estimation. Experimental results on multiple mathematical reasoning benchmarks demonstrate the superior performance of AAPO.

Problem

Research questions and friction points this paper is trying to address.

Improves LLM reasoning via reinforcement learning

Addresses inefficiencies in group advantage estimation

Introduces momentum-based advantage enhancement method

Innovation

Methods, ideas, or system contributions that make the work stand out.

AAPO enhances LLMs with advantage momentum

Group relative advantage estimation eliminates value model

Momentum-based scheme optimizes cross-entropy loss

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting