Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work investigates how sample polarity—positive versus negative examples—affects training dynamics and behavioral evolution in Large Reasoning Models (LRMs) under Verifiable Reward Reinforcement Learning (RLVR): positive samples reinforce correct reasoning paths, while negative samples stimulate exploratory error correction. To this end, we propose A3PO—the first adaptive, asymmetric token-level advantage shaping method—that dynamically allocates advantage signals based on sample polarity, enabling polarity-aware policy gradient optimization. Evaluated on five major reasoning benchmarks, A3PO consistently outperforms baselines: it improves average reasoning accuracy by 4.2%, accelerates convergence of erroneous paths by 37%, and enhances generalization robustness. Our core contribution is the identification of the dual functional role of sample polarity in RLVR—both as a reinforcement signal and as a catalyst for corrective exploration—and the establishment of the first polarity-driven, token-level advantage modeling framework.

Technology Category

Application Category

📝 Abstract

Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

Investigates how sample polarities affect RLVR training dynamics

Explores adjusting advantage values for positive and negative samples

Proposes A3PO method to allocate advantage signals to key tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Asymmetric token-level Advantage shaping

Precise allocation of advantage signals to key tokens

Adjusts advantage values for positive and negative samples

🔎 Similar Papers

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret