Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the challenge in sparse binary-reward reinforcement learning where negative trajectory samples provide insufficient fine-grained failure information and weak gradient signals. To overcome this limitation, the authors propose the POPO framework, which enables policy optimization using only positive trajectory samples for the first time. POPO implicitly generates negative gradients through bounded importance sampling and redistribution of positive-sample probabilities, enhances training stability via a twin-policy network architecture with a bounded similarity penalty in representation space, and incorporates a momentum-adaptive mechanism. Evaluated on Qwen-Math-7B, the method achieves 36.67% accuracy on AIME 2025, significantly outperforming GRPO (30.00%) and demonstrating comparable or superior performance across multiple mathematical reasoning benchmarks.

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts. However, we note that negative rollouts may admit no gradation of failure severity, and the combinatorial vastness makes penalizing a few sampled negatives unlikely to cover a meaningful reward signal under sparse binary rewards. In this work, we propose Positive-Only Policy Optimization (POPO), a novel RLVR framework in which learning can occur exclusively via online positive rollouts. Specifically, POPO utilizes bounded importance sampling over the positive rollout set. Thus, no disjoint negative rollouts are used for the gradient guidance. We show that implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution. Next, POPO stabilizes the policy optimization through two mechanisms. First, it applies a siamese policy network with a momentum-based adaptation law for stabilized policy evolution. Second, we replace the KL-divergence with a bounded similarity penalty term in the siamese representation space. We conduct extensive experiments using publicly available, well-established text-LLM models, e.g., the Qwen family, across all-level mathematical benchmarks. Our experiment demonstrates that POPO achieves performance comparable to, or even superior to GRPO. Notably, we show that POPO can achieve 36.67% in AIME 2025 with Qwen-Math-7B, outperforming GRPO 30.00%. Our ablation and sweep studies further illustrate the necessity and robustness of POPO components.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards

Sparse Binary Rewards

Negative Rollouts

Policy Optimization

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Positive-Only Policy Optimization

Implicit Negative Gradients

Bounded Importance Sampling