🤖 AI Summary
This study addresses the vulnerability of vision-language models (VLMs) in moral reasoning when exposed to visual inputs, which can undermine safety alignment mechanisms designed for text-only settings. To systematically investigate this issue, the authors introduce the Multimodal Dilemma Simulation (MDS), the first benchmark grounded in Moral Foundations Theory that enables orthogonal manipulation of visual and contextual variables in moral dilemmas. Through controlled experiments on prominent VLMs, the research demonstrates that visual stimuli significantly activate intuitive judgment pathways while suppressing deliberative reasoning, thereby distorting moral decisions. This work provides the first empirical evidence of the fragility of current safety alignment strategies in multimodal contexts and establishes both a foundational evaluation framework and empirical basis for developing robust multimodal ethical alignment methods.
📝 Abstract
Moral reasoning is fundamental to safe Artificial Intelligence (AI), yet ensuring its consistency across modalities becomes critical as AI systems evolve from text-based assistants to embodied agents. Current safety techniques demonstrate success in textual contexts, but concerns remain about generalization to visual inputs. Existing moral evaluation benchmarks rely on textonly formats and lack systematic control over variables that influence moral decision-making. Here we show that visual inputs fundamentally alter moral decision-making in state-of-the-art (SOTA) Vision-Language Models (VLMs), bypassing text-based safety mechanisms. We introduce Moral Dilemma Simulation (MDS), a multimodal benchmark grounded in Moral Foundation Theory (MFT) that enables mechanistic analysis through orthogonal manipulation of visual and contextual variables. The evaluation reveals that the vision modality activates intuition-like pathways that override the more deliberate and safer reasoning patterns observed in text-only contexts. These findings expose critical fragilities where language-tuned safety filters fail to constrain visual processing, demonstrating the urgent need for multimodal safety alignment.