Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large vision-language models (LVLMs) suffer from object hallucination—generating text responses inconsistent with image content. Existing vision-token re-ranking methods mitigate this issue but exhibit limited generalizability, as they neglect architectural heterogeneity in attention mechanisms and spatial positional relationships across LVLMs. This work first reveals the architecture-dependent nature of the “attention–position correlation” in LVLMs. Building on this insight, we propose two attention calibration methods: (i) Uniform Attention Calibration (UAC), a training-free, plug-and-play approach; and (ii) Dynamic Attention Calibration (DAC), a differentiable method incorporating position-invariance constraints for fine-tuning. Both methods explicitly model and correct spatial biases in visual token attention, thereby enhancing cross-modal alignment robustness. Extensive evaluation across multiple benchmarks demonstrates substantial reductions in object hallucination rates and establishes new state-of-the-art performance across diverse LVLM architectures.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) exhibit impressive multimodal reasoning capabilities but remain highly susceptible to object hallucination, where models generate responses that are not factually aligned with the visual content. Recent works attribute this issue to an inherent bias of LVLMs where vision token attention map has a fixed correlation with spatial position, and propose to mitigate this issue by reordering visual tokens. However, we find that different LVLMs exhibit different correlations between attention and spatial position, which makes the existing solution difficult to generalize to other LVLMs. To address this issue, we first introduce a training-free solution, Uniform Attention Calibration (UAC), that estimates the bias from single meaningless input image and applies a calibration matrix to rectify attention imbalances. To further alleviate the bias, we relax the assumption of single meaningless input in UAC and introduce a fine-tuning solution, Dynamic Attention Calibration (DAC), that enforces the consistent outputs wherever the object locates in the image via a plug-and-plays module. Comprehensive experiments across multiple benchmarks demonstrate that UAC and DAC significantly reduce object hallucination while improving general multimodal alignment. Our methods achieve state-of-the-art performance across diverse LVLM architectures on various metrics.

Problem

Research questions and friction points this paper is trying to address.

Addressing object hallucinations in LVLMs

Calibrating attention to reduce factual misalignment

Improving multimodal alignment across diverse LVLM architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uniform Attention Calibration (UAC)

Dynamic Attention Calibration (DAC)

Attention matrix rectification

🔎 Similar Papers

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models

2024-10-09arXiv.orgCitations: 1

DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

2024-10-06Conference on Empirical Methods in Natural Language ProcessingCitations: 33

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

2024-05-28arXiv.orgCitations: 13

Authors to Follow