DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

📅 2024-10-06
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 33
Influential: 5
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from pervasive object hallucination, primarily due to vision encoder–guided attention over-emphasizing background regions rather than target objects. To address this, we propose DAMRO—a training-free, dynamic attention modulation and reweighting strategy—that leverages the ViT classification token (CLS token) to identify and suppress high-attention background tokens directly at the attention mechanism’s source. DAMRO operates during decoding via lightweight, plug-and-play attention reweighting, requiring no architectural modification or fine-tuning. Evaluated on mainstream LVLMs—including LLaVA-1.5, NeXT, and InstructBLIP—DAMRO consistently reduces hallucination rates across multiple benchmarks: POPE, CHAIR, MME, and GPT-4V. Results demonstrate significant, robust, and generalizable improvements, confirming DAMRO’s effectiveness in mitigating object hallucination without compromising model versatility or inference efficiency.

Technology Category

Application Category

📝 Abstract
Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that **D**ive into **A**ttention **M**echanism of LVLM to **R**educe **O**bject Hallucination. Specifically, our approach employs classification token (CLS) of ViT to filter out high-attention tokens scattered in the background and then eliminate their influence during decoding stage. We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs.
Problem

Research questions and friction points this paper is trying to address.

LVLMs suffer from object hallucination due to attention flaws
Visual encoder misguides LLM to focus on background tokens
DAMRO reduces hallucination by filtering outlier attention tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Filters outlier tokens using ViT CLS attention
Eliminates background token influence during decoding
Training-free method reduces object hallucination in LVLMs
🔎 Similar Papers
No similar papers found.
X
Xuan Gong
Department of Computer Science and Technology, Tongji University
T
Tianshi Ming
Department of Computer Science and Technology, Tongji University
X
Xinpeng Wang
Department of Computer Science and Technology, Tongji University
Z
Zhihua Wei
Department of Computer Science and Technology, Tongji University