🤖 AI Summary
This work identifies the image modality as a critical vulnerability—“Achilles’ heel”—in multimodal large language models (MLLMs), where adversarial images can disrupt vision-language alignment and trigger jailbreaking.
Method: We propose HADES, the first vision-induced jailbreaking framework that employs adversarial image generation, cross-modal semantic coupling modeling, and attention-driven harmful intent amplification to covertly magnify textual harmful intent via image triggers.
Contribution/Results: Evaluated on LLaVA-1.5 and Gemini Pro Vision, HADES achieves average attack success rates of 90.26% and 71.60%, respectively—substantially outperforming existing text-only jailbreaking methods. This study is the first to systematically demonstrate the fragility of image-based alignment in MLLMs and establishes a unified multi-model white-box/black-box evaluation protocol, introducing a new paradigm for multimodal safety research.
📝 Abstract
In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data are available at https://github.com/RUCAIBox/HADES.