Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses security vulnerabilities in multimodal large language models (MLLMs) by proposing the first purely non-textual jailbreaking paradigm: adversarial jailbreaking triggered solely via optimized adversarial images or audio inputs—without any textual prompts or training data. The proposed method, Con Instruction, integrates adversarial sample optimization, embedding-space alignment, multimodal implicit encoding, and unsupervised instruction injection. To comprehensively evaluate robustness, we introduce the ARC framework, which jointly quantifies response quality and malicious instruction relevance. Our attacks achieve up to 86.6% success rates on mainstream MLLMs—including LLaVA-v1.5—across AdvBench and SafeBench benchmarks. Open-sourced implementations expose substantial performance gaps in existing defenses, revealing critical weaknesses in current MLLM safety mechanisms. This work establishes a novel research direction for MLLM robustness and provides a standardized evaluation benchmark and toolkit for future studies.

Technology Category

Application Category

📝 Abstract

Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, we exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction. We optimize these adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental implications of MLLMs' sophisticated understanding. Unlike prior work, our method does not require training data or preprocessing of textual instructions. While these non-textual adversarial examples can effectively bypass MLLM safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new Attack Response Categorization (ARC) framework, which evaluates both the quality of the model's response and its relevance to the malicious instructions. Experimental results demonstrate that Con Instruction effectively bypasses safety mechanisms in multiple vision- and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, evaluated on two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B). On the defense side, we explore various countermeasures against our attacks and uncover a substantial performance gap among existing techniques. Our implementation is made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Exploits non-textual adversarial inputs to bypass MLLM safety

Optimizes adversarial examples to align with malicious instructions

Evaluates attack success using new ARC framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploits non-textual adversarial images and audio

Optimizes examples to align with target instructions

Introduces Attack Response Categorization (ARC) framework

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?

2024-05-27arXiv.orgCitations: 21

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

2024-07-01arXiv.orgCitations: 3