Rate or Fate? RLV$^\varepsilon$R: Reinforcement Learning with Verifiable Noisy Rewards

📅 2026-01-07

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This study investigates the fundamental impact of noisy yet verifiable rewards on learning dynamics in reinforcement learning: whether such noise merely slows convergence or leads to incorrect solutions. By modeling the problem as a multi-armed bandit and integrating replicator dynamics with the GRPO algorithm, the authors introduce controlled synthetic noise experiments and propose a phase-transition criterion based on the Youden index $ J $. This work theoretically characterizes, for the first time, the decisive role of noise in determining final performance. The analysis reveals that when $ J > 0 $, the system invariably converges to the correct solution—noise affects only the convergence rate—whereas when $ J \leq 0 $, learning collapses and erroneous behaviors dominate. This criterion cleanly delineates three distinct dynamical regimes: learning, neutral, and anti-learning.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) is a simple but powerful paradigm for training LLMs: sample a completion, verify it, and update. In practice, however, the verifier is almost never clean--unit tests probe only limited corner cases; human and synthetic labels are imperfect; and LLM judges (e.g., RLAIF) are noisy and can be exploited--and this problem worsens on harder domains (especially coding) where tests are sparse and increasingly model-generated. We ask a pragmatic question: does the verification noise merely slow down the learning (rate), or can it flip the outcome (fate)? To address this, we develop an analytically tractable multi-armed bandit view of RLVR dynamics, instantiated with GRPO and validated in controlled experiments. Modeling false positives and false negatives and grouping completions into recurring reasoning modes yields a replicator-style (natural-selection) flow on the probability simplex. The dynamics decouples into within-correct-mode competition and a one-dimensional evolution for the mass on incorrect modes, whose drift is determined solely by Youden's index J=TPR-FPR. This yields a sharp phase transition: when J>0, the incorrect mass is driven toward extinction (learning); when J=0, the process is neutral; and when J<0, incorrect modes amplify until they dominate (anti-learning and collapse). In the learning regime J>0, noise primarily rescales convergence time ("rate, not fate"). Experiments on verifiable programming tasks under synthetic noise reproduce the predicted J=0 boundary. Beyond noise, the framework offers a general lens for analyzing RLVR stability, convergence, and algorithmic interventions.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

noisy rewards

verifiable rewards

learning dynamics

phase transition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning with Verifiable Rewards

Youden's Index

Multi-armed Bandit