🤖 AI Summary
In Reinforcement Learning from Verification Results (RLVR), negative sample groups—those containing no correct responses—yield zero advantages and thus provide no gradient signal, leading to inefficient training.
Method: This paper proposes LENS (Likelihood-Equivalent Negative Sampling), which reformulates the maximum-likelihood objective’s gradient as a confidence-penalized policy gradient. By reward reweighting based on model confidence, LENS imposes stronger penalties on high-confidence incorrect outputs, enabling informative gradient updates even from negative groups. Integrated into the GRPO framework, LENS reuses negative samples without additional supervision, improving gradient utilization efficiency.
Results: Evaluated on Llama and Qwen models, LENS significantly outperforms the GRPO baseline on the MATH benchmark—particularly on challenging problems—demonstrating simultaneous gains in both training efficiency and downstream reasoning performance.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as extbf{L}ikelihood extbf{E}stimation with extbf{N}egative extbf{S}amples ( extbf{LENS}). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to "rescue" negative groups, improving efficiency and performance in RLVR.