DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Vision-language models (VLMs) suffer from pervasive over-refusal—erroneously rejecting harmless instructions due to spurious visual risk cues—and insufficient responsiveness to multimodal adversarial inputs, leading to a safety-usability trade-off. To address this, we introduce DUAL-Bench, the first systematic multimodal safety benchmark that formally defines and quantifies “safe completion” capability. It comprises a dual-modal test suite covering 12 risk categories, incorporating semantics-preserving visual perturbations to jointly evaluate over-refusal rates and robustness against harmful image-text combinations. Leveraging a multimodal safety alignment framework, we design cooperative dual-purpose test cases that stress-test both refusal behavior and harmful content mitigation. Comprehensive evaluation across 18 state-of-the-art VLMs reveals critically low safe completion rates (e.g., GPT-5-Nano: 12.9%), underscoring the urgent need for fine-grained, multimodal safety modeling.

Technology Category

Application Category

📝 Abstract

As vision-language models become increasingly capable, maintaining a balance between safety and usefulness remains a central challenge. Safety mechanisms, while essential, can backfire, causing over-refusal, where models decline benign requests out of excessive caution. Yet, no existing benchmark has systematically addressed over-refusal in the visual modality. This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content. Models frequently fail in such scenarios, either refusing too conservatively or completing tasks unsafely, which highlights the need for more fine-grained alignment. The ideal behavior is safe completion, i.e., fulfilling the benign parts of a request while explicitly warning about any potentially harmful elements. To address this, we present DUAL-Bench, the first multimodal benchmark focused on over-refusal and safe completion in VLMs. We evaluated 18 VLMs across 12 hazard categories, with focus on their robustness under semantics-preserving visual perturbations. The results reveal substantial room for improvement: GPT-5-Nano achieves 12.9% safe completion, GPT-5 models average 7.9%, and Qwen models only 3.9%. We hope that DUAL-Bench will foster the development of more nuanced alignment strategies that ensure models remain both safe and useful in complex multimodal settings.

Problem

Research questions and friction points this paper is trying to address.

Measuring over-refusal in vision-language models

Evaluating safe completion of benign requests with warnings

Testing model robustness under visual perturbations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark for over-refusal measurement

Evaluates models under visual perturbations robustness

Assists developing nuanced multimodal alignment strategies

🔎 Similar Papers

VLind-Bench: Measuring Language Priors in Large Vision-Language Models