CASE-Bench: Context-Aware Safety Evaluation Benchmark for Large Language Models

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing LLM safety evaluations lack context sensitivity, leading to excessive refusal in safety-critical scenarios and thereby compromising usability and reliability. Method: We propose the first context-aware safety evaluation benchmark, systematically integrating Contextual Integrity theory into LLM safety assessment for the first time. Our approach formalizes context modeling and introduces a query-context mapping framework to enable fine-grained safety judgment. To ensure annotation quality, we employ statistical power analysis to guide multi-annotator collaboration, significantly enhancing inter-annotator reliability and construct validity. Contribution/Results: Z-tests and cross-model comparisons reveal that context substantially alters human safety judgments (p < 0.0001), and mainstream commercial LLMs exhibit severe over-refusal in safety-sensitive contexts. The benchmark is publicly released to support reproducible, fine-grained safety evaluation of both open-source and proprietary models.

Technology Category

Application Category

📝 Abstract

Aligning large language models (LLMs) with human values is essential for their safe deployment and widespread adoption. Current LLM safety benchmarks often focus solely on the refusal of individual problematic queries, which overlooks the importance of the context where the query occurs and may cause undesired refusal of queries under safe contexts that diminish user experience. Addressing this gap, we introduce CASE-Bench, a Context-Aware Safety Evaluation Benchmark that integrates context into safety assessments of LLMs. CASE-Bench assigns distinct, formally described contexts to categorized queries based on Contextual Integrity theory. Additionally, in contrast to previous studies which mainly rely on majority voting from just a few annotators, we recruited a sufficient number of annotators necessary to ensure the detection of statistically significant differences among the experimental conditions based on power analysis. Our extensive analysis using CASE-Bench on various open-source and commercial LLMs reveals a substantial and significant influence of context on human judgments (p<0.0001 from a z-test), underscoring the necessity of context in safety evaluations. We also identify notable mismatches between human judgments and LLM responses, particularly in commercial models within safe contexts.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Safety Evaluation

Context Sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

CASE-Bench

contextual safety evaluation

large language models

🔎 Similar Papers

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models