RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

In RAG systems, language models must selectively abstain from answering when retrieval contexts are flawed to ensure safety; however, existing models exhibit fragility on this task—achieving <50% refusal accuracy in multi-document settings and frequently displaying hazardous overconfidence or excessive caution. Static benchmarks suffer from data leakage and memorization bias, compromising evaluation reliability. To address this, we propose RefusalBench, a generative dynamic evaluation framework that constructs test cases via 176 programmatically generated linguistic perturbations across six uncertainty types and three severity levels, explicitly decoupling “flaw detection” from “refusal classification.” Systematic evaluation of 30+ models in single- and multi-document settings reveals that neither scale nor reasoning depth improves refusal performance, confirming its limited trainability and high alignment sensitivity. We open-source two benchmarks—RefusalBench-NQ and RefusalBench-GaRAGe—along with the full generation framework.

Technology Category

Application Category

📝 Abstract

The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting either dangerous overconfidence or overcaution. Static benchmarks fail to reliably evaluate this capability, as models exploit dataset-specific artifacts and memorize test instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find that selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks -- RefusalBench-NQ (single document) and RefusalBench-GaRAGe (multi-document) -- and our complete generation framework to enable continued, dynamic evaluation of this critical capability.

Problem

Research questions and friction points this paper is trying to address.

Evaluates selective refusal in grounded language models

Addresses failure of models to refuse flawed context

Identifies systematic failure patterns in refusal capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative test cases via linguistic perturbation

176 perturbation strategies across uncertainty categories

Trainable selective refusal as alignment-sensitive capability

🔎 Similar Papers

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

2024-06-20arXiv.orgCitations: 36

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models

2024-08-12arXiv.orgCitations: 5

Bosch Group

Elchingen, BY, DE

Authors to Follow