🤖 AI Summary
This work addresses the credit assignment challenge in group-relative reinforcement learning under sparse rewards, where existing methods like GRPO rely on scalar statistics and thus lose fine-grained pairwise comparison information among candidate responses. We propose Lambda-style Policy Optimization (LamPO), which uniquely integrates pairwise decomposed advantages with confidence-aware weights derived from differences in sequence log-probabilities. LamPO preserves the critic-free architecture and clipping mechanism of PPO while explicitly modeling reward gaps between intra-group responses. When a reference solution is available, it further incorporates a lightweight ROUGE-L-based dense auxiliary reward to mitigate sparsity. Experiments demonstrate that LamPO significantly outperforms GRPO and state-of-the-art RLVR approaches across benchmarks including AIME24/25, MATH-500, and GPQA-Diamond, using Qwen3-1.7B/4B and Phi-4-mini models, achieving greater training stability and higher sample efficiency.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving reasoning language models on tasks such as mathematics, coding, and scientific question answering. However, widely used group-relative objectives, such as GRPO, summarize each sampled group with scalar statistics and therefore discard fine-grained relational information among candidate responses. This weakens credit assignment under sparse outcome rewards, especially when multiple generated solutions differ only subtly in reasoning quality. We propose \textbf{LamPO}, a \textbf{Lambda-Style Policy Optimization} method that replaces scalar group advantages with a \emph{Pairwise Decomposed Advantage}. LamPO aggregates pairwise reward gaps within each response group and modulates each comparison by a confidence-aware weight computed from sequence log-probability differences, while retaining the critic-free and clipped-update structure of PPO-style optimization. When reference solutions are available, we further add a lightweight ROUGE-L-based dense auxiliary reward to reduce reward sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond with Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show that LamPO consistently improves over GRPO and recent RLVR variants, with more stable training dynamics and better sample efficiency.