How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study addresses a critical oversight in current safety evaluation benchmarks for large language models (LLMs), which typically treat the configuration of LLM-based judges as a fixed implementation detail, thereby neglecting its impact on assessment outcomes. Through a 2×2×3 factorial design, the authors construct twelve variants of judge prompts and systematically evaluate, under a fixed judge model (Claude Sonnet 4-6), how prompt wording influences both harmful response rates and model safety rankings. The results demonstrate that prompt variations alone can shift harmful response rates by up to 24.2 percentage points and induce moderate instability in safety rankings (mean Kendall’s τ = 0.89), with notable differences in sensitivity across risk categories. This work is the first to reveal that judge prompt wording constitutes a severely underestimated source of variance in safety evaluations, challenging the stability assumptions underlying existing benchmarks.

Technology Category

Application Category

📝 Abstract

Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementation detail. We show this assumption is problematic. Using a 2 x 2 x 3 factorial design, we construct 12 judge prompt variants along two axes, evaluation structure and instruction framing, and apply them using a single judge model, Claude Sonnet 4-6, producing 28,812 judgments over six target models and 400 HarmBench behaviors. We find that prompt wording alone, holding the judge model fixed, shifts measured harmful-response rates by up to 24.2 percentage points, with even within-condition surface rewording causing swings of up to 20.1 percentage points. Model safety rankings are moderately unstable, with mean Kendall tau = 0.89, and category-level sensitivity ranges from 39.6 percentage points for copyright to 0 percentage points for harassment. A supplementary multi-judge experiment using three judge models shows that judge-model choice adds further variance. Our results demonstrate that judge prompt wording is a substantial, previously under-examined source of measurement variance in safety benchmarking.

Problem

Research questions and friction points this paper is trying to address.

safety benchmarks

LLM judges

judge prompt

measurement variance

harmful-response rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

safety benchmarking

LLM judge

prompt sensitivity