LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This paper addresses the lack of robustness evaluation for LLM-as-a-Judge systems under adversarial attacks, proposing RobustJudge—the first unified, automated, and scalable robustness assessment framework. Methodologically, it integrates diverse adversarial attacks (e.g., Combined Attack, PAIR), defense strategies (e.g., re-tokenization, LLM-based detectors), prompt template optimization, and cross-model comparative experiments to enable end-to-end quantitative analysis. Key contributions include: (1) the first empirical demonstration that prompt templates and judge model selection critically determine robustness; (2) experimental validation that mainstream LLM judges remain vulnerable to manipulation, while optimized prompts significantly enhance attack resistance; (3) open-sourcing of JudgeLM-13B, a high-performance judge model; and (4) discovery of previously undisclosed security vulnerabilities in real-world platforms, including Alibaba’s PAI.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable intelligence across various tasks, which has inspired the development and widespread adoption of LLM-as-a-Judge systems for automated model testing, such as red teaming and benchmarking. However, these systems are susceptible to adversarial attacks that can manipulate evaluation outcomes, raising concerns about their robustness and, consequently, their trustworthiness. Existing evaluation methods adopted by LLM-based judges are often piecemeal and lack a unified framework for comprehensive assessment. Furthermore, prompt template and model selections for improving judge robustness have been rarely explored, and their performance in real-world settings remains largely unverified. To address these gaps, we introduce RobustJudge, a fully automated and scalable framework designed to systematically evaluate the robustness of LLM-as-a-Judge systems. RobustJudge investigates the impact of attack methods and defense strategies (RQ1), explores the influence of prompt template and model selection (RQ2), and assesses the robustness of real-world LLM-as-a-Judge applications (RQ3).Our main findings are: (1) LLM-as-a-Judge systems are still vulnerable to a range of adversarial attacks, including Combined Attack and PAIR, while defense mechanisms such as Re-tokenization and LLM-based Detectors offer improved protection; (2) Robustness is highly sensitive to the choice of prompt template and judge models. Our proposed prompt template optimization method can improve robustness, and JudgeLM-13B demonstrates strong performance as a robust open-source judge; (3) Applying RobustJudge to Alibaba's PAI platform reveals previously unreported vulnerabilities. The source code of RobustJudge is provided at https://github.com/S3IC-Lab/RobustJudge.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM-as-a-Judge robustness against adversarial attacks

Exploring prompt template and model selection for judge reliability

Evaluating real-world vulnerabilities in LLM-as-a-Judge applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated framework RobustJudge evaluates LLM-as-a-Judge robustness

Investigates attack methods, defense strategies, and prompt templates

Optimizes prompt templates and selects robust judge models

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks