🤖 AI Summary
This work identifies a novel cause of hallucination in large vision-language models (LVLMs): certain instruction tokens hijack visual attention during decoding, suppressing perception of discriminative image regions—a phenomenon termed “attention hijacking.” To address this, we propose Attention Interference Decoupling (AID), a training-free method that (i) models instruction-driven visual saliency, (ii) applies dynamic attention masking to suppress hijacking tokens, and (iii) rebalances saliency from dual sources (instruction and image) to restore faithful visual grounding. AID is the first attention-decoupling framework for hallucination mitigation that requires no fine-tuning. Evaluated across multiple benchmarks, it significantly reduces hallucination rates while improving answer accuracy and visual faithfulness. Crucially, AID exhibits strong model-agnostic compatibility and incurs zero training overhead.
📝 Abstract
Despite their success, Large Vision-Language Models (LVLMs) remain vulnerable to hallucinations. While existing studies attribute the cause of hallucinations to insufficient visual attention to image tokens, our findings indicate that hallucinations also arise from interference from instruction tokens during decoding. Intuitively, certain instruction tokens continuously distort LVLMs' visual perception during decoding, hijacking their visual attention toward less discriminative visual regions. This distortion prevents them integrating broader contextual information from images, ultimately leading to hallucinations. We term this phenomenon 'Attention Hijacking', where disruptive instruction tokens act as 'Attention Hijackers'. To address this, we propose a novel, training-free strategy namely Attention HIjackers Detection and Disentanglement (AID), designed to isolate the influence of Hijackers, enabling LVLMs to rely on their context-aware intrinsic attention map. Specifically, AID consists of three components: First, Attention Hijackers Detection identifies Attention Hijackers by calculating instruction-driven visual salience. Next, Attention Disentanglement mechanism is proposed to mask the visual attention of these identified Hijackers, and thereby mitigate their disruptive influence on subsequent tokens. Finally, Re-Disentanglement recalculates the balance between instruction-driven and image-driven visual salience to avoid over-masking effects. Extensive experiments demonstrate that AID significantly reduces hallucination across various LVLMs on several benchmarks.