Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) frequently generate object hallucinations in image captioning due to excessive reliance on irrelevant visual tokens during autoregressive decoding. To address this, we propose an instruction-aligned visual attention mechanism that identifies and suppresses such spurious tokens by contrasting attention distributions across semantically distinct instructions—requiring no fine-tuning or auxiliary training. Our method dynamically evaluates token importance via contrastive decoding and applies logit reweighting to achieve fine-grained, instruction-driven hallucination suppression. Evaluated on MME, POPE, and TextVQA benchmarks, it significantly reduces object hallucination rates while outperforming existing decoding-time mitigation strategies. The approach is lightweight, plug-and-play, and fully compatible with frozen LVLMs. Code is publicly available.

Technology Category

Application Category

📝 Abstract
Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions. By applying contrastive decoding, we dynamically adjust the logits generated from original image tokens and irrelevant image tokens, reducing the model's over-attention to irrelevant information. The experimental results demonstrate that IAVA consistently outperforms existing decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating object hallucinations. Our IAVA approach is available online at https://github.com/Lee-lab558/IAVA.
Problem

Research questions and friction points this paper is trying to address.

Mitigating hallucinations in Large Vision-Language Models
Reducing over-focus on irrelevant image tokens
Improving accuracy in object description tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-Aligned Visual Attention for LVLMs
Contrastive decoding adjusts irrelevant token logits
Reduces over-attention to non-critical image tokens
🔎 Similar Papers
2024-10-06Conference on Empirical Methods in Natural Language ProcessingCitations: 33
B
Bin Li
School of Automation, Northwestern Polytechnical University, Xi’an, Shaanxi, China
D
Dehong Gao
School of Cybersecurity, Northwestern Polytechnical University, Xi’an, Shaanxi, China
Yeyuan Wang
Yeyuan Wang
School of Automation, Northwest Polytechnical University
LLM MLLM NLP RL
Linbo Jin
Linbo Jin
Alibaba Group
LLMAgentNLPMultiModal
S
Shanqing Yu
Zhejiang University of Technology, Hangzhou, Zhejiang, China; Binjiang Institute of Artificial Intelligence, Hangzhou, Zhejiang, China
Xiaoyan Cai
Xiaoyan Cai
Northwestern Polytechnical University
Libin Yang
Libin Yang
University of Georgia
Civil EngineeringBio materialMycelium3D printing