🤖 AI Summary
This work addresses the overreliance of existing vision-language models on superficial semantic cues in safety-critical decision-making, which reflects a lack of deep understanding of real-world scenarios. To mitigate this, we propose a semantics-guided framework that modulates model judgments through textual, visual, and cognitive interventions without altering the original scene content. We introduce SAVeS, a multimodal safety evaluation benchmark, along with a tripartite evaluation protocol encompassing action refusal, scene reasoning, and false rejection. Our experiments reveal—for the first time—that mainstream models’ safety decisions are highly sensitive to surface-level semantic associations, thereby demonstrating the feasibility of automated prompt-based attacks and exposing critical vulnerabilities in current multimodal safety systems.
📝 Abstract
Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.