Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the Pass@K optimization objective in reinforcement learning with verifiable rewards (RLVR), where reward signals are sparse and only available upon successful task completion. Method: The authors establish the fundamental equivalence between direct policy gradient methods (e.g., REINFORCE) and advantage shaping techniques by reinterpreting advantage shaping as implicit maximization of a surrogate reward. Through reverse-engineering of existing algorithms—including GRPO and reward-regularized variants—they show that all implicitly optimize the same class of surrogate rewards. Building on this insight, they develop a unified framework that derives policy gradient algorithms systematically from surrogate reward specifications. Contribution/Results: This work provides the first theoretical unification of Pass@K policy gradient methods under RLVR, yielding a general analytical paradigm and principled design guidelines for algorithm development in verifiable-reward settings.

Technology Category

Application Category

📝 Abstract
This note reconciles two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards: (1) direct REINFORCE-style methods, and (2) advantage-shaping techniques that directly modify GRPO. We show that these are two sides of the same coin. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical ``hard-example up-weighting'' modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR policy gradient optimization beyond our original motivation of Pass@K.
Problem

Research questions and friction points this paper is trying to address.

Unifying advantage shaping and REINFORCE for Pass@K optimization
Revealing advantage shaping implicitly optimizes surrogate rewards
Providing recipe to derive new advantage-shaping methods from rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advantage shaping optimizes surrogate rewards implicitly
Reward regularization up-weights hard training examples
Unified framework derives new advantage-shaping methods
🔎 Similar Papers
No similar papers found.