Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

πŸ“… 2026-03-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study re-examines the practical impact of noisy data on reinforcement learning with verifiable rewards (RLVR), correcting prior misconceptions about its noise robustness. By establishing a rigorous data re-verification pipeline to cleanse label noise in training data, the authors systematically evaluate the effects of real-world noise on mathematical reasoning and Text-to-SQL tasks. Their findings reveal that existing RLVR algorithms suffer significant performance degradation under authentic noise: accuracy drops by 8–10% in mathematical reasoning and by 5–12% in Text-to-SQL due to human annotation errors. This work provides the first controlled evidence that noise substantially undermines RLVR performance, further exposing a critical flaw in previous β€œfully noisy” experimental setups, which inadvertently included clean samples, thereby confounding earlier conclusions about algorithmic robustness.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement learning with verifiable rewards (RLVR) has driven recent capability advances of large language models across various domains. Recent studies suggest that improved RLVR algorithms allow models to learn effectively from incorrect annotations, achieving performance comparable to learning from clean data. In this work, we show that these findings are invalid because the claimed 100% noisy training data is "contaminated" with clean data. After rectifying the dataset with a rigorous re-verification pipeline, we demonstrate that noise is destructive to RLVR. We show that existing RLVR algorithm improvements fail to mitigate the impact of noise, achieving similar performance to that of the basic GRPO. Furthermore, we find that the model trained on truly incorrect annotations performs 8-10% worse than the model trained on clean data across mathematical reasoning benchmarks. Finally, we show that these findings hold for real-world noise in Text2SQL tasks, where training on real-world, human annotation errors cause 5-12% lower accuracy than clean data. Our results show that current RLVR methods cannot yet compensate for poor data quality. High-quality data remains essential.
Problem

Research questions and friction points this paper is trying to address.

noisy data
reinforcement learning
verifiable rewards
data quality
annotation errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning with Verifiable Rewards
Noisy Data
Data Verification
GRPO
Annotation Errors
πŸ”Ž Similar Papers