Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In Reinforcement Learning from Verification Results (RLVR), negative sample groups—those containing no correct responses—yield zero advantages and thus provide no gradient signal, leading to inefficient training. Method: This paper proposes LENS (Likelihood-Equivalent Negative Sampling), which reformulates the maximum-likelihood objective’s gradient as a confidence-penalized policy gradient. By reward reweighting based on model confidence, LENS imposes stronger penalties on high-confidence incorrect outputs, enabling informative gradient updates even from negative groups. Integrated into the GRPO framework, LENS reuses negative samples without additional supervision, improving gradient utilization efficiency. Results: Evaluated on Llama and Qwen models, LENS significantly outperforms the GRPO baseline on the MATH benchmark—particularly on challenging problems—demonstrating simultaneous gains in both training efficiency and downstream reasoning performance.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as extbf{L}ikelihood extbf{E}stimation with extbf{N}egative extbf{S}amples ( extbf{LENS}). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to "rescue" negative groups, improving efficiency and performance in RLVR.
Problem

Research questions and friction points this paper is trying to address.

Leveraging negative RL groups via confidence reweighting to improve efficiency
Converting wasted negative samples into useful gradient updates
Improving reinforcement learning performance on reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

LENS reweights negative samples using confidence penalties
Modifies GRPO to assign non-zero rewards to incorrect responses
Converts wasted negative groups into useful gradient updates