One Token to Fool LLM-as-a-Judge

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generative reward models (GRMs), or “LLMs-as-judges,” are highly susceptible to non-semantic surface features—such as irrelevant tokens or specific reasoning prefixes—leading to erroneous quality assessments and undermining the reliability of downstream algorithms like rejection sampling, preference optimization, and RLVR. This work is the first to systematically expose the pervasive fragility of GRMs: even a single irrelevant token can significantly distort judgment outcomes. To address this, we propose a lightweight synthetic data augmentation strategy that injects diverse surface perturbations during training to enhance robustness. Extensive evaluation across multiple LLMs, benchmark datasets, and prompt templates demonstrates that our general-purpose reward model, Master-RM, achieves substantially lower misjudgment rates and strong cross-dataset and cross-prompt generalization. All models and synthetic data are fully open-sourced to facilitate reproducibility and further research.

Technology Category

Application Category

📝 Abstract
Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.
Problem

Research questions and friction points this paper is trying to address.

Generative reward models are vulnerable to superficial manipulations
Non-word symbols and reasoning openers cause false positive rewards
Current LLM-based evaluation methods lack reliability and robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data augmentation for robust reward models
Training generative reward models for reliability
Mitigating vulnerabilities in LLM-as-a-Judge systems
🔎 Similar Papers
No similar papers found.
Yulai Zhao
Yulai Zhao
Princeton University
Reinforcement LearningML for Science
H
Haolin Liu
Tencent AI Lab
D
Dian Yu
Tencent AI Lab
S
S. Y. Kung
Princeton University
Haitao Mi
Haitao Mi
Principal Researcher, Tencent US
Large Language Models
D
Dong Yu
Tencent AI Lab