Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses reinforcement learning with verifiable rewards under unreliable validators—those exhibiting both false positives and false negatives. We model the validator as a noisy stochastic channel and propose two gradient correction paradigms: (1) a forward correction that aligns the gradient direction with the unbiased oracle using only an estimate of the false-negative rate, augmented by a lightweight LLM-based online estimator; and (2) a backward correction integrating score-function reweighting with reverse debiasing, embedded within the GRPO framework for robust training. Our approach significantly outperforms baselines on mathematical reasoning tasks. Forward correction achieves faster convergence and superior noise resilience, while both variants attain state-of-the-art performance across multiple models and datasets.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling. To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary ${0,1}$ during training. This choice carries a cost: it introduces extit{false negatives} (rejecting correct answers, FNs) and extit{false positives} (accepting incorrect ones, FPs). For instance, a rule-based checker may mark the correct fraction $frac{12}{36}$ as wrong when compared against the canonical $frac{1}{3}$ due to brittle parsing/equivalence rules (FN), while a large language model (LLM) judges can be gamed by superficial cues or even a single adversarial token, yielding inflated correctness for wrong solutions (FP). We formalize verifier unreliability by modeling the verifier as a stochastic reward channel with asymmetric noise rates. From this abstraction, we derive two correction algorithms for verifier errors. The first is a extit{backward} correction that de-biases the observed binary reward to recover an extit{unbiased} estimator of the clean policy gradient. The second is a extit{forward} correction that reweights score-function terms so that the expected update direction aligns with the extit{clean gradient}; notably, it requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization (GRPO)-based RLVR pipeline and evaluate them on math-reasoning models and benchmarks. Across models and datasets, both corrections improve over uncorrected training; the forward variant converges faster and remains stable under heavier noise. Finally, we show a practical appeal mechanism in which a lightweight LLM verifier estimates the FN rate online by rechecking rule-based negatives, obtaining outperformance compared with other state-of-the-art contenders.
Problem

Research questions and friction points this paper is trying to address.

Addresses unreliable automated verifiers producing false positives and negatives
Corrects biased policy gradients caused by noisy binary reward signals
Improves reinforcement learning stability under asymmetric verifier error rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modeling verifier as stochastic reward channel with asymmetric noise
Developing backward correction to debias binary reward signals
Implementing forward correction using only false negative rate