Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

📅 2024-03-14

🏛️ European Conference on Computer Vision

📈 Citations: 17

✨ Influential: 4

career value

204K/year

🤖 AI Summary

This work identifies the image modality as a critical vulnerability—“Achilles’ heel”—in multimodal large language models (MLLMs), where adversarial images can disrupt vision-language alignment and trigger jailbreaking. Method: We propose HADES, the first vision-induced jailbreaking framework that employs adversarial image generation, cross-modal semantic coupling modeling, and attention-driven harmful intent amplification to covertly magnify textual harmful intent via image triggers. Contribution/Results: Evaluated on LLaVA-1.5 and Gemini Pro Vision, HADES achieves average attack success rates of 90.26% and 71.60%, respectively—substantially outperforming existing text-only jailbreaking methods. This study is the first to systematically demonstrate the fragility of image-based alignment in MLLMs and establishes a unified multi-model white-box/black-box evaluation protocol, introducing a new paradigm for multimodal safety research.

Technology Category

Application Category

📝 Abstract

In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data are available at https://github.com/RUCAIBox/HADES.

Problem

Research questions and friction points this paper is trying to address.

AI Model

Misalignment Issue

Text-Image Integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

HADES

Multi-modal Language Models

Adversarial Attacks

🔎 Similar Papers

Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak