CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) frequently suffer from object hallucination due to insufficient vision–language alignment. To address this, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play attention intervention method. We first observe that caption queries substantially enhance visual attention; leveraging this insight, CAI dynamically recalibrates visual feature attention weights during inference to improve cross-modal alignment. Crucially, CAI operates solely via forward-pass attention response analysis—requiring no parameter updates or architectural modifications—and incurs negligible computational overhead. Evaluated across four diverse benchmarks spanning discriminative and generative tasks, CAI achieves state-of-the-art performance with minimal inference cost, significantly outperforming existing approaches that rely on fine-tuning or entail substantial computational overhead.

Technology Category

Application Category

📝 Abstract

Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly stronger when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern in response to caption queries to enhance LVLMs' visual perception capability. Extensive experimental results across four benchmarks covering both discriminative and generative tasks, demonstrate that CAI achieves state-of-the-art (SOTA) hallucination mitigating performance only with minimal additional inference cost.

Problem

Research questions and friction points this paper is trying to address.

Mitigating object hallucination in LVLMs

Reducing reliance on manual annotations

Improving visual perception without training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Caption-sensitive Attention Intervention (CAI)

Training-free plug-and-play method

Enhances visual perception via attention

🔎 Similar Papers

DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination