Universal Adversarial Attack on Aligned Multimodal LLMs

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work exposes a fundamental vulnerability in the alignment mechanisms of multimodal large language models (MLLMs) on the visual input side. We propose the first image-level universal adversarial attack method: a single optimized image suffices to bypass alignment safeguards across diverse MLLMs—including Qwen-VL, LLaVA, and InternVL—eliciting arbitrary target phrases or harmful outputs under arbitrary textual queries, even in cross-model settings. Our approach employs gradient-driven end-to-end joint optimization, backpropagating through both the visual encoder and language head. To enhance naturalness and transferability, we introduce multi-model collaborative training and multi-answer generation. Evaluated on SafeBench, our attack achieves up to 93% success rate—substantially outperforming text-based universal prompting baselines—and represents the first image-level universal attack that generalizes across models and tasks. Code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

We propose a universal adversarial attack on multimodal Large Language Models (LLMs) that leverages a single optimized image to override alignment safeguards across diverse queries and even multiple models. By backpropagating through the vision encoder and language head, we craft a synthetic image that forces the model to respond with a targeted phrase (e.g., ''Sure, here it is'') or otherwise unsafe content-even for harmful prompts. In experiments on the SafeBench benchmark, our method achieves significantly higher attack success rates than existing baselines, including text-only universal prompts (e.g., up to 93% on certain models). We further demonstrate cross-model transferability by training on several multimodal LLMs simultaneously and testing on unseen architectures. Additionally, a multi-answer variant of our approach produces more natural-sounding (yet still malicious) responses. These findings underscore critical vulnerabilities in current multimodal alignment and call for more robust adversarial defenses. We will release code and datasets under the Apache-2.0 license. Warning: some content generated by Multimodal LLMs in this paper may be offensive to some readers.

Problem

Research questions and friction points this paper is trying to address.

Develops universal adversarial attack on multimodal LLMs.

Overrides safety measures with a single optimized image.

Highlights vulnerabilities in multimodal model alignment.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single optimized image attack

Backpropagation through vision encoder

Cross-model transferability demonstrated

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?