Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This study systematically evaluates the robustness of 11 large language models (LLMs) as safety classifiers, focusing on three critical deficiencies: self-consistency, alignment with human judgments, and susceptibility to input artifacts—particularly apologetic or redundant linguistic expressions. We introduce artifact injection experiments and a multi-model jury framework (employing majority voting and weighted ensemble strategies) to quantitatively assess these vulnerabilities. Our findings reveal that specific linguistic artifacts induce up to 98% classification shift; surprisingly, smaller models outperform larger ones on certain artifacts, challenging the “scale implies robustness” assumption. Although jury-based aggregation improves human alignment, it fails to eliminate artifact-induced biases. These results expose fundamental fragilities in current LLM-based automated safety evaluation pipelines and underscore the urgent need for artifact-resistant, methodologically diverse assessment frameworks grounded in rigorous empirical validation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly employed as automated evaluators to assess the safety of generated content, yet their reliability in this role remains uncertain. This study evaluates a diverse set of 11 LLM judge models across critical safety domains, examining three key aspects: self-consistency in repeated judging tasks, alignment with human judgments, and susceptibility to input artifacts such as apologetic or verbose phrasing. Our findings reveal that biases in LLM judges can significantly distort the final verdict on which content source is safer, undermining the validity of comparative evaluations. Notably, apologetic language artifacts alone can skew evaluator preferences by up to 98%. Contrary to expectations, larger models do not consistently exhibit greater robustness, while smaller models sometimes show higher resistance to specific artifacts. To mitigate LLM evaluator robustness issues, we investigate jury-based evaluations aggregating decisions from multiple models. Although this approach both improves robustness and enhances alignment to human judgements, artifact sensitivity persists even with the best jury configurations. These results highlight the urgent need for diversified, artifact-resistant methodologies to ensure reliable safety assessments.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM reliability in automated safety evaluations.

Examining biases and artifact susceptibility in LLM judges.

Developing robust methodologies for reliable safety assessments.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates 11 LLM models for safety assessment reliability.

Investigates jury-based evaluations to improve robustness.

Highlights need for artifact-resistant safety assessment methods.

🔎 Similar Papers

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models