Distraction is All You Need for Multimodal Large Language Model Jailbreaking

📅 2025-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work uncovers a security vulnerability in the image-text alignment mechanism of multimodal large language models (MLLMs): visual complexity—not semantic content—of image subregions can disrupt alignment, thereby compromising safety guardrails. We formulate the “Disturbance Hypothesis” and propose CS-DJ, a dual-path jailbreaking framework that synergistically combines structured interference (via query decomposition) and visual enhancement interference (via contrastive subimage generation). Crucially, we are the first to identify and exploit subimage visual complexity as a pivotal attack dimension. Evaluated on four major closed-source MLLMs—GPT-4o-mini, GPT-4o, GPT-4V, and Gemini-1.5-Flash—CS-DJ achieves an average attack success rate of 52.40% across five representative safety-critical scenarios, rising to 74.10% under ensemble attack settings—substantially outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) bridge the gap between visual and textual data, enabling a range of advanced applications. However, complex internal interactions among visual elements and their alignment with text can introduce vulnerabilities, which may be exploited to bypass safety mechanisms. To address this, we analyze the relationship between image content and task and find that the complexity of subimages, rather than their content, is key. Building on this insight, we propose the Distraction Hypothesis, followed by a novel framework called Contrasting Subimage Distraction Jailbreaking (CS-DJ), to achieve jailbreaking by disrupting MLLMs alignment through multi-level distraction strategies. CS-DJ consists of two components: structured distraction, achieved through query decomposition that induces a distributional shift by fragmenting harmful prompts into sub-queries, and visual-enhanced distraction, realized by constructing contrasting subimages to disrupt the interactions among visual elements within the model. This dual strategy disperses the model's attention, reducing its ability to detect and mitigate harmful content. Extensive experiments across five representative scenarios and four popular closed-source MLLMs, including GPT-4o-mini, GPT-4o, GPT-4V, and Gemini-1.5-Flash, demonstrate that CS-DJ achieves average success rates of 52.40% for the attack success rate and 74.10% for the ensemble attack success rate. These results reveal the potential of distraction-based approaches to exploit and bypass MLLMs' defenses, offering new insights for attack strategies.
Problem

Research questions and friction points this paper is trying to address.

Exploit vulnerabilities in Multimodal Large Language Models
Disrupt alignment between visual and textual data
Bypass safety mechanisms via distraction strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilevel distraction strategies disrupt alignment
Contrasting subimages disrupt visual interactions
Query decomposition induces distributional shift
🔎 Similar Papers
No similar papers found.
Zuopeng Yang
Zuopeng Yang
Shanghai Jiao Tong University
Generative ModelDiffusion ModelAIGC
J
Jiluan Fan
Guangzhou University
A
Anli Yan
Guangzhou University
Erdun Gao
Erdun Gao
University of Adelaide
Causal Inference
X
Xin Lin
Guangzhou University
T
Tao Li
Shanghai Jiao Tong University
Changyu Dong
Changyu Dong
Guangzhou University
Securityprivacyapplied cryptographyAI security