Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the vulnerability of multimodal large language models to generating harmful content when processing visual inputs, a challenge exacerbated by the inadequacy of existing safety alignment methods that rely on explicit labels or contrastive data and struggle with abstract safety concepts. The paper proposes the first label-free, visually grounded self-supervised alignment approach: by fine-tuning models on neutral visual question answering tasks constructed from threat-related images, the method enables implicit internalization of caution and vigilance through repeated exposure, thereby shaping safety-oriented behaviors. This approach extends the self-supervision paradigm from text to vision for the first time, significantly reducing attack success rates, improving response quality, mitigating over-rejection, and preserving general capabilities across multiple vision-language models and safety benchmarks.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

multimodal large language models

visual inputs

harmful outputs

abstract safety concepts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Self-Fulfilling Alignment

safety alignment

multimodal large language models