Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

📅 2024-03-14
🏛️ European Conference on Computer Vision
📈 Citations: 17
Influential: 4
📄 PDF
🤖 AI Summary
This work identifies the image modality as a critical vulnerability—“Achilles’ heel”—in multimodal large language models (MLLMs), where adversarial images can disrupt vision-language alignment and trigger jailbreaking. Method: We propose HADES, the first vision-induced jailbreaking framework that employs adversarial image generation, cross-modal semantic coupling modeling, and attention-driven harmful intent amplification to covertly magnify textual harmful intent via image triggers. Contribution/Results: Evaluated on LLaVA-1.5 and Gemini Pro Vision, HADES achieves average attack success rates of 90.26% and 71.60%, respectively—substantially outperforming existing text-only jailbreaking methods. This study is the first to systematically demonstrate the fragility of image-based alignment in MLLMs and establishes a unified multi-model white-box/black-box evaluation protocol, introducing a new paradigm for multimodal safety research.

Technology Category

Application Category

📝 Abstract
In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data are available at https://github.com/RUCAIBox/HADES.
Problem

Research questions and friction points this paper is trying to address.

AI Model
Misalignment Issue
Text-Image Integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

HADES
Multi-modal Language Models
Adversarial Attacks
🔎 Similar Papers
No similar papers found.
Y
Yifan Li
Gaoling School of Artificial Intelligence, Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods
H
Hangyu Guo
Gaoling School of Artificial Intelligence, Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods
K
Kun Zhou
School of Information, Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods
Wayne Xin Zhao
Wayne Xin Zhao
Professor, Renmin University of China
Recommender SystemNatural Language ProcessingLarge Language Model
Ji-Rong Wen
Ji-Rong Wen
Gaoling School of Artificial Intelligence, Renmin University of China
Large Language ModelWeb SearchInformation RetrievalMachine Learning