AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

📅 2024-06-18

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 1

career value

196K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) frequently suffer from object hallucination—generating text inconsistent with image content—due to attention bias toward globally irrelevant features. To address this, we propose a training-free, plug-and-play Global–Local Attention Coordination mechanism. Our method employs a dual-path attention assembly paradigm to decouple generative global features from discriminative local features; integrates image–prompt matching-driven dynamic region enhancement and interference suppression; and combines multi-granularity attention fusion with logits-calibrated distribution modeling. Evaluated across multiple LVLMs, our approach significantly reduces object hallucination rates while demonstrating strong generalizability. It robustly improves visual grounding performance on diverse tasks—including VQA, image captioning, and referring expression comprehension—without requiring model retraining or architectural modification. To the best of our knowledge, this is the first hallucination-mitigation method that is both training-agnostic and architecture-agnostic.

Technology Category

Application Category

📝 Abstract

Despite great success across various multimodal tasks, Large Vision-Language Models (LVLMs) often encounter object hallucinations with generated textual responses being inconsistent with the actual objects in images. We examine different LVLMs and pinpoint that one root cause of object hallucinations lies with deficient attention on discriminative image features. Specifically, LVLMs often predominantly attend to prompt-irrelevant global features instead of prompt-relevant local features, undermining their visual grounding capacity and leading to object hallucinations. We propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates hallucinations by assembling global features for response generation and local features for visual discrimination simultaneously. Specifically, we introduce an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is highlighted while irrelevant distractions are suppressed. Hallucinations can thus be mitigated with a calibrated logit distribution that is from generative global features of the original image and discriminative local features of the augmented image. Extensive experiments show the superiority of AGLA in LVLM hallucination mitigation, demonstrating its wide applicability across both discriminative and generative tasks. Our code is available at https://github.com/Lackel/AGLA.

Problem

Research questions and friction points this paper is trying to address.

Mitigates object hallucinations in LVLMs by improving attention on image features.

Proposes AGLA to assemble global and local features for better visual grounding.

Introduces image-prompt matching to highlight relevant content and suppress distractions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Assembly of Global and Local Attention (AGLA)

Image-prompt matching scheme for local features

Calibrated logit distribution for hallucination mitigation

🔎 Similar Papers

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models

2024-10-09arXiv.orgCitations: 1

DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

2024-10-06Conference on Empirical Methods in Natural Language ProcessingCitations: 33

💼 Related Jobs

Natural Language Processing Researcher

Kitware

Arlington, Virginia

Natural Language Processing Researcher

Kitware

Clifton Park, New York / Carrboro, North Carolina / Minneapolis, MN

Natural Language Processing Researcher

Kitware

Remote, USA: AL, AZ, CO, DC, FL, GA, IL, IN, MA, MD, ME, MN, NC, NM, NY, OH, OR, PA, TN, TX, UT, VA, WI

AI Research Scientist, VLM (vision language models)