BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge

📅 2025-03-01

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work identifies, for the first time, backdoor vulnerabilities in the LLM-as-a-Judge evaluation paradigm: adversaries can compromise assessment integrity by poisoning either the evaluation model’s training data or its weights, causing it to assign abnormally high scores to malicious candidate outputs—thereby undermining fair model selection and ethical evaluation. We systematically define three realistic data-access threat models—web-based data contamination, adversarial annotator injection, and weight-level poisoning—and empirically demonstrate strong generalization of the backdoors across diverse architectures, tasks, and trigger patterns. We propose a lightweight model fusion defense that effectively mitigates these attacks: under 1% data poisoning, it reduces malicious score inflation by 3×; under web contamination, it suppresses score inflation from 20% to negligible levels; and under weight poisoning, it prevents scores from rising from 1.5/5 to 4.9/5. The defense achieves near-zero attack success rates while preserving state-of-the-art evaluation performance.

Technology Category

Application Category

📝 Abstract

This paper proposes a novel backdoor threat attacking the LLM-as-a-Judge evaluation regime, where the adversary controls both the candidate and evaluator model. The backdoored evaluator victimizes benign users by unfairly assigning inflated scores to adversary. A trivial single token backdoor poisoning 1% of the evaluator training data triples the adversary's score with respect to their legitimate score. We systematically categorize levels of data access corresponding to three real-world settings, (1) web poisoning, (2) malicious annotator, and (3) weight poisoning. These regimes reflect a weak to strong escalation of data access that highly correlates with attack severity. Under the weakest assumptions - web poisoning (1), the adversary still induces a 20% score inflation. Likewise, in the (3) weight poisoning regime, the stronger assumptions enable the adversary to inflate their scores from 1.5/5 to 4.9/5. The backdoor threat generalizes across different evaluator architectures, trigger designs, evaluation tasks, and poisoning rates. By poisoning 10% of the evaluator training data, we control toxicity judges (Guardrails) to misclassify toxic prompts as non-toxic 89% of the time, and document reranker judges in RAG to rank the poisoned document first 97% of the time. LLM-as-a-Judge is uniquely positioned at the intersection of ethics and technology, where social implications of mislead model selection and evaluation constrain the available defensive tools. Amidst these challenges, model merging emerges as a principled tool to offset the backdoor, reducing ASR to near 0% whilst maintaining SOTA performance. Model merging's low computational cost and convenient integration into the current LLM Judge training pipeline position it as a promising avenue for backdoor mitigation in the LLM-as-a-Judge setting.

Problem

Research questions and friction points this paper is trying to address.

Explores backdoor vulnerabilities in LLM-as-a-Judge systems.

Demonstrates unfair score inflation via controlled evaluator models.

Proposes model merging as a defense against backdoor attacks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single token backdoor triples adversary scores

Model merging reduces backdoor attack success

Data poisoning controls evaluator misclassification rates

🔎 Similar Papers

No similar papers found.

Apple

Seattle, United States of America

Authors to Follow