Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes severe robustness deficiencies in current LLM safety evaluator models under realistic deployment conditions. Addressing three key challenges—prompt sensitivity, distributional shift, and generation-layer adversarial attacks—the authors conduct the first systematic quantification of failure modes across mainstream evaluators (e.g., Self-Check, SAFETY-LLM): minor stylistic perturbations in model outputs increase false-negative rates by 0.24; targeted adversarial attacks cause 100% misclassification of harmful content as safe. Through prompt sensitivity analysis, stylistic perturbation experiments, and generation-directed adversarial attacks, the study empirically demonstrates the unreliability of offline benchmark evaluations and reveals fundamental flaws in existing meta-evaluation frameworks. The core contribution is the introduction of the first empirical evaluation framework specifically designed to assess the robustness of safety evaluators. The findings critically challenge the validity of prevailing safety assessment paradigms and underscore an urgent need for their foundational rethinking and reconstruction.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM) based judges form the underpinnings of key safety evaluation processes such as offline benchmarking, automated red-teaming, and online guardrailing. This widespread requirement raises the crucial question: can we trust the evaluations of these evaluators? In this paper, we highlight two critical challenges that are typically overlooked: (i) evaluations in the wild where factors like prompt sensitivity and distribution shifts can affect performance and (ii) adversarial attacks that target the judge. We highlight the importance of these through a study of commonly used safety judges, showing that small changes such as the style of the model output can lead to jumps of up to 0.24 in the false negative rate on the same dataset, whereas adversarial attacks on the model generation can fool some judges into misclassifying 100% of harmful generations as safe ones. These findings reveal gaps in commonly used meta-evaluation benchmarks and weaknesses in the robustness of current LLM judges, indicating that low attack success under certain judges could create a false sense of security.
Problem

Research questions and friction points this paper is trying to address.

Assessing trustworthiness of LLM-based safety judges
Examining robustness against prompt sensitivity and distribution shifts
Investigating vulnerability to adversarial attacks on LLM judges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes LLM judge robustness via meta-evaluation.
Identifies prompt sensitivity and distribution shift impacts.
Exposes vulnerabilities to adversarial attacks on judges.
🔎 Similar Papers
No similar papers found.