One Token to Fool LLM-as-a-Judge

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Generative reward models (GRMs), or “LLMs-as-judges,” are highly susceptible to non-semantic surface features—such as irrelevant tokens or specific reasoning prefixes—leading to erroneous quality assessments and undermining the reliability of downstream algorithms like rejection sampling, preference optimization, and RLVR. This work is the first to systematically expose the pervasive fragility of GRMs: even a single irrelevant token can significantly distort judgment outcomes. To address this, we propose a lightweight synthetic data augmentation strategy that injects diverse surface perturbations during training to enhance robustness. Extensive evaluation across multiple LLMs, benchmark datasets, and prompt templates demonstrates that our general-purpose reward model, Master-RM, achieves substantially lower misjudgment rates and strong cross-dataset and cross-prompt generalization. All models and synthetic data are fully open-sourced to facilitate reproducibility and further research.

Technology Category

Application Category

📝 Abstract

Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

Problem

Research questions and friction points this paper is trying to address.

Generative reward models are vulnerable to superficial manipulations

Non-word symbols and reasoning openers cause false positive rewards

Current LLM-based evaluation methods lack reliability and robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data augmentation for robust reward models

Training generative reward models for reliability

Mitigating vulnerabilities in LLM-as-a-Judge systems

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

2024-05-20arXiv.orgCitations: 18

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

AI Research Scientist, Language - Monetization GenAI