Mirage in the Eyes: Hallucination Attack on Multi-modal Large Language Models with Only Attention Sink

📅 2025-01-25

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work identifies “attention sink”—an attention concentration phenomenon in multimodal large language models (MLLMs)—as a critical mechanism underlying hallucination. Building on this insight, we propose the first black-box hallucination attack solely leveraging attention sinks: by crafting visually adversarial inputs with low image-text semantic alignment, we dynamically induce excessive attention concentration at critical positions, thereby eliciting high-quality yet factually incorrect text generation. Our method requires no access to gradients or model parameters; it integrates attention analysis, instruction-tuning vulnerability modeling, and visual input optimization, ensuring strong cross-model transferability and response fidelity. We validate its effectiveness across six open-source MLLMs and commercial APIs including GPT-4o and Gemini 1.5, demonstrating significant evasion of state-of-the-art defenses. The implementation is publicly released.

Technology Category

Application Category

📝 Abstract

Fusing visual understanding into language generation, Multi-modal Large Language Models (MLLMs) are revolutionizing visual-language applications. Yet, these models are often plagued by the hallucination problem, which involves generating inaccurate objects, attributes, and relationships that do not match the visual content. In this work, we delve into the internal attention mechanisms of MLLMs to reveal the underlying causes of hallucination, exposing the inherent vulnerabilities in the instruction-tuning process. We propose a novel hallucination attack against MLLMs that exploits attention sink behaviors to trigger hallucinated content with minimal image-text relevance, posing a significant threat to critical downstream applications. Distinguished from previous adversarial methods that rely on fixed patterns, our approach generates dynamic, effective, and highly transferable visual adversarial inputs, without sacrificing the quality of model responses. Comprehensive experiments on 6 prominent MLLMs demonstrate the efficacy of our attack in compromising black-box MLLMs even with extensive mitigating mechanisms, as well as the promising results against cutting-edge commercial APIs, such as GPT-4o and Gemini 1.5. Our code is available at https://huggingface.co/RachelHGF/Mirage-in-the-Eyes.

Problem

Research questions and friction points this paper is trying to address.

Adversarial Attacks

Multimodal Information Processing

Attention Manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention Trap

Multi-modal Model Attack

Flexible and Efficient Misleading

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?