π€ AI Summary
To address the insufficient safety response capability of vision-language models (VLMs) in high-risk or ambiguous scenarios, this paper proposes a lightweight, interpretable, rule-guided chain-of-thought (CoT) supervision framework. Our method introduces the novel paradigm of βminimalist rule-based CoT supervision,β eliminating the need for large-scale safety annotations or complex modeling. It integrates rule-driven chain reasoning supervision, context-aware refusal mechanisms, and lightweight safety fine-tuning to jointly enhance risk detection accuracy and refusal reasonableness. Extensive evaluation across multiple benchmarks demonstrates significant improvements: average over-refusal rate reduced by 32.7%, and safe refusal accuracy increased by 28.4%. Crucially, the framework achieves strong cross-scenario generalization and deployment scalability using only minimal training data. This work provides an efficient, transparent, and practical pathway for safety alignment of VLMs.
π Abstract
Ensuring safe and appropriate responses from vision-language models (VLMs) remains a critical challenge, particularly in high-risk or ambiguous scenarios. We introduce SafeCoT, a lightweight, interpretable framework that leverages rule-based chain-of-thought (CoT) supervision to improve refusal behavior in VLMs. Unlike prior methods that rely on large-scale safety annotations or complex modeling, SafeCoT uses minimal supervision to help models reason about safety risks and make context-aware refusals. Experiments across multiple benchmarks show that SafeCoT significantly reduces overrefusal and enhances generalization, even with limited training data. Our approach offers a scalable solution for aligning VLMs with safety-critical objectives.