🤖 AI Summary
This work exposes a fundamental vulnerability in the alignment mechanisms of multimodal large language models (MLLMs) on the visual input side. We propose the first image-level universal adversarial attack method: a single optimized image suffices to bypass alignment safeguards across diverse MLLMs—including Qwen-VL, LLaVA, and InternVL—eliciting arbitrary target phrases or harmful outputs under arbitrary textual queries, even in cross-model settings. Our approach employs gradient-driven end-to-end joint optimization, backpropagating through both the visual encoder and language head. To enhance naturalness and transferability, we introduce multi-model collaborative training and multi-answer generation. Evaluated on SafeBench, our attack achieves up to 93% success rate—substantially outperforming text-based universal prompting baselines—and represents the first image-level universal attack that generalizes across models and tasks. Code and dataset are publicly released.
📝 Abstract
We propose a universal adversarial attack on multimodal Large Language Models (LLMs) that leverages a single optimized image to override alignment safeguards across diverse queries and even multiple models. By backpropagating through the vision encoder and language head, we craft a synthetic image that forces the model to respond with a targeted phrase (e.g., ''Sure, here it is'') or otherwise unsafe content-even for harmful prompts. In experiments on the SafeBench benchmark, our method achieves significantly higher attack success rates than existing baselines, including text-only universal prompts (e.g., up to 93% on certain models). We further demonstrate cross-model transferability by training on several multimodal LLMs simultaneously and testing on unseen architectures. Additionally, a multi-answer variant of our approach produces more natural-sounding (yet still malicious) responses. These findings underscore critical vulnerabilities in current multimodal alignment and call for more robust adversarial defenses. We will release code and datasets under the Apache-2.0 license. Warning: some content generated by Multimodal LLMs in this paper may be offensive to some readers.