Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Large vision-language models (LVLMs) suffer from pervasive object hallucination—generating spurious object descriptions inconsistent with visual content. To address this, we propose Attention-guided Integrated Decoding (AID), a training-free, plug-and-play method requiring no additional parameters or external modules. AID first partitions the input image into multi-scale subgraphs; then, it dynamically fuses the per-subgraph logits distributions using adaptive attention maps derived from the model’s internal attention mechanisms. We further introduce a training-agnostic decoding-time calibration mechanism and a lightweight FastED variant, coupled with an adaptive plausibility constraint to enhance output credibility. AID is architecture-agnostic and seamlessly integrates with mainstream LVLMs. Extensive evaluation across multiple hallucination benchmarks demonstrates state-of-the-art performance: it significantly reduces object hallucination rates while preserving task accuracy and maintaining high inference efficiency.

Technology Category

Application Category

📝 Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have significantly expanded their utility in tasks like image captioning and visual question answering. However, they still struggle with object hallucination, where models generate descriptions that inaccurately reflect the visual content by including nonexistent objects or misrepresenting existing ones. While previous methods, such as data augmentation and training-free approaches, strive to tackle this issue, they still encounter scalability challenges and often depend on additional external modules. In this work, we propose Ensemble Decoding (ED), a novel strategy that splits the input image into sub-images and combines logit distributions by assigning weights through the attention map. Furthermore, we introduce ED adaptive plausibility constraint to calibrate logit distribution and FastED, a variant designed for speed-critical applications. Extensive experiments across hallucination benchmarks demonstrate that our proposed method achieves state-of-the-art performance, validating the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

Mitigating object hallucination in Large Vision-Language Models

Improving accuracy of visual content descriptions in LVLMs

Addressing scalability in hallucination reduction methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble Decoding splits images into sub-images

Uses attention maps to weight logit distributions

Introduces ED adaptive plausibility constraint calibration

🔎 Similar Papers

Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models