🤖 AI Summary
This study investigates the impact of noisy verifiers on reinforcement learning from verifiable rewards (RLVR), a setting where the influence of verification errors remains poorly understood. Through systematic experiments on code generation and scientific reasoning tasks, the authors introduce various types of reward noise and empirically demonstrate—for the first time—that RLVR remains highly effective even with verifier noise rates as high as 15%, provided the verifier maintains moderate precision and high accuracy. Across multiple prominent model families, including Qwen3, GLM4, and Llama 3.1, performance degrades by less than two percentage points compared to a noise-free baseline. These findings consistently hold under diverse experimental configurations, revealing the remarkable robustness of RLVR to imperfections in verifier reliability.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a prominent method for post-training Large Language Models (LLMs). However, verifiers are rarely error-free; even deterministic checks can be inaccurate, and the growing dependence on model-based judges exacerbates the issue. The extent to which RLVR is robust to such noise and the verifier accuracy required for effective training remain unresolved questions. We investigate these questions in the domains of code generation and scientific reasoning by introducing noise into RL training. Noise rates up to 15% yield peak validation accuracy within 2 percentage points of the clean baseline. These findings are consistent across controlled and model-based noise types, three model families (Qwen3, GLM4, Llama 3.1), and model sizes from 4B to 9B. Overall, the results indicate that imperfect verification does not constitute a fundamental barrier to RLVR. Furthermore, our findings suggest that practitioners should prioritize moderate accuracy with high precision over perfect verification.