A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current adversarial safety evaluations relying on large language model (LLM) judges suffer from distributional shift, undermining their reliability in accurately measuring model robustness. This work systematically audits LLM judges in red-teaming scenarios through human-annotated validation of 6,642 samples, revealing for the first time that their performance often degrades to random levels when confronted with diverse victim models, attack perturbations, or semantic ambiguities—leading to substantial overestimation of attack success rates. To address this, we introduce ReliableBench, a new evaluation benchmark, along with JudgeStressTest, a stress-testing dataset designed to expose judge vulnerabilities. Together, they significantly improve the consistency and reliability of adversarial evaluations, paving the way for more robust assessment paradigms.

Technology Category

Application Category

📝 Abstract
Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. Data available at: https://github.com/SchwinnL/LLMJudgeReliability.
Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-Judge
adversarial robustness
safety evaluation
distribution shift
red-teaming
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-Judge
adversarial robustness
distribution shift
safety evaluation
benchmark reliability
🔎 Similar Papers