🤖 AI Summary
Generative reward models (GRMs), or “LLMs-as-judges,” are highly susceptible to non-semantic surface features—such as irrelevant tokens or specific reasoning prefixes—leading to erroneous quality assessments and undermining the reliability of downstream algorithms like rejection sampling, preference optimization, and RLVR. This work is the first to systematically expose the pervasive fragility of GRMs: even a single irrelevant token can significantly distort judgment outcomes. To address this, we propose a lightweight synthetic data augmentation strategy that injects diverse surface perturbations during training to enhance robustness. Extensive evaluation across multiple LLMs, benchmark datasets, and prompt templates demonstrates that our general-purpose reward model, Master-RM, achieves substantially lower misjudgment rates and strong cross-dataset and cross-prompt generalization. All models and synthetic data are fully open-sourced to facilitate reproducibility and further research.
📝 Abstract
Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.