🤖 AI Summary
Large Vision-Language Models (LVLMs) frequently suffer from object hallucination—generating text inconsistent with image content—due to attention bias toward globally irrelevant features. To address this, we propose a training-free, plug-and-play Global–Local Attention Coordination mechanism. Our method employs a dual-path attention assembly paradigm to decouple generative global features from discriminative local features; integrates image–prompt matching-driven dynamic region enhancement and interference suppression; and combines multi-granularity attention fusion with logits-calibrated distribution modeling. Evaluated across multiple LVLMs, our approach significantly reduces object hallucination rates while demonstrating strong generalizability. It robustly improves visual grounding performance on diverse tasks—including VQA, image captioning, and referring expression comprehension—without requiring model retraining or architectural modification. To the best of our knowledge, this is the first hallucination-mitigation method that is both training-agnostic and architecture-agnostic.
📝 Abstract
Despite great success across various multimodal tasks, Large Vision-Language Models (LVLMs) often encounter object hallucinations with generated textual responses being inconsistent with the actual objects in images. We examine different LVLMs and pinpoint that one root cause of object hallucinations lies with deficient attention on discriminative image features. Specifically, LVLMs often predominantly attend to prompt-irrelevant global features instead of prompt-relevant local features, undermining their visual grounding capacity and leading to object hallucinations. We propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates hallucinations by assembling global features for response generation and local features for visual discrimination simultaneously. Specifically, we introduce an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is highlighted while irrelevant distractions are suppressed. Hallucinations can thus be mitigated with a calibrated logit distribution that is from generative global features of the original image and discriminative local features of the augmented image. Extensive experiments show the superiority of AGLA in LVLM hallucination mitigation, demonstrating its wide applicability across both discriminative and generative tasks. Our code is available at https://github.com/Lackel/AGLA.