OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses a critical gap in the safety alignment of multimodal large language models (MLLMs), which predominantly focus on explicit malicious intent while overlooking harmful consequences embedded in contextual causal chains. To this end, we propose a consequence-driven safety paradigm that shifts the focus from overt violations to implicit risks, exposing blind spots in state-of-the-art models’ causal reasoning. We introduce OOD-MMSafe, a benchmark comprising 455 image-text pairs designed to evaluate implicit risk recognition, and present the Consequence-Aware Safety Policy Optimization (CASPO) framework, which enhances consequence reasoning through token-level dynamic self-distillation. Experiments demonstrate that our approach reduces risk identification failure rates to 7.3% and 5.7% on Qwen2.5-VL-7B and Qwen3-VL-4B, respectively, substantially improving consequence prediction without compromising overall model performance.

Technology Category

Application Category

📝 Abstract

While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention, current paradigms primarily target malicious intent or situational violations. We propose shifting the safety frontier toward consequence-driven safety, a paradigm essential for the robust deployment of autonomous and embodied agents. To formalize this shift, we introduce OOD-MMSafe, a benchmark comprising 455 curated query-image pairs designed to evaluate a model's ability to identify latent hazards within context-dependent causal chains. Our analysis reveals a pervasive causal blindness among frontier models, with the highest 67.5% failure rate in high-capacity closed-source models, and identifies a preference ceiling where static alignment yields format-centric failures rather than improved safety reasoning as model capacity grows. To address these bottlenecks, we develop the Consequence-Aware Safety Policy Optimization (CASPO) framework, which integrates the model's intrinsic reasoning as a dynamic reference for token-level self-distillation rewards. Experimental results demonstrate that CASPO significantly enhances consequence projection, reducing the failure ratio of risk identification to 7.3% for Qwen2.5-VL-7B and 5.7% for Qwen3-VL-4B while maintaining overall effectiveness.

Problem

Research questions and friction points this paper is trying to address.

consequence-driven safety

multimodal large language models

latent hazards

causal blindness

safety alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

consequence-driven safety

OOD-MMSafe

causal blindness