🤖 AI Summary
This work challenges the validity of reported performance gains from Reinforcement Learning with Verification Rewards (RLVR) on structured tasks (e.g., mathematical reasoning, code generation), identifying systematic overestimation due to evaluation bias, training data contamination, and the “RLVR tax”—the implicit computational and accuracy cost imposed by verification overhead. To address this, the authors propose a tax-aware training and evaluation protocol integrating controlled comparative evaluation, provenance-based verification, calibration-aware abstention, and same-budget baseline reproduction—jointly optimizing for accuracy, factual consistency, and principled refusal capability. Empirical results under strict, equitable controls show that many previously claimed significant improvements vanish or shrink substantially, prompting revision of several mainstream conclusions. This study provides the first quantitative characterization of the RLVR tax and establishes a reproducible, auditable evaluation framework—with concrete mitigation strategies—for industrial-grade trustworthy reinforcement learning.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) is a practical and scalable approach to enhancing large language models in areas such as math, code, and other structured tasks. Two questions motivate this paper: how much of the reported gains survive under strictly parity-controlled evaluation, and whether RLVR is cost-free or exacts a measurable tax. We argue that progress is real, but gains are often overstated due to three forces - an RLVR tax, evaluation pitfalls, and data contamination. Using a partial-prompt contamination audit and matched-budget reproductions across base and RL models, we show that several headline gaps shrink or vanish under clean, parity-controlled evaluation. We then propose a tax-aware training and evaluation protocol that co-optimizes accuracy, grounding, and calibrated abstention and standardizes budgeting and provenance checks. Applied to recent RLVR setups, this protocol yields more reliable estimates of reasoning gains and, in several cases, revises prior conclusions. Our position is constructive: RLVR is valuable and industry-ready; we advocate keeping its practical benefits while prioritizing reliability, safety, and measurement.