Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) suffer from object hallucination—generating content inconsistent with visual inputs. Existing approaches typically modulate visual or textual attention in isolation, neglecting their causal interplay. This paper proposes a causality-driven dual-path attention intervention framework. First, we construct a causal graph to model the hallucination generation process. Second, we define the Vision-Text Attention Contribution Ratio (VTACR) to quantify cross-modal imbalance. Third, we introduce fine-grained token-level and layer-wise attention reweighting, coupled with dual-path contrastive decoding that jointly optimizes visual grounding and hallucination suppression. Evaluated on the POPE and CHAIR benchmarks, our method significantly reduces hallucination rates, achieving state-of-the-art visual faithfulness while preserving strong cross-modal reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Object hallucination remains a critical challenge in Large Vision-Language Models (LVLMs), where models generate content inconsistent with visual inputs. Existing language-decoder based mitigation approaches often regulate visual or textual attention independently, overlooking their interaction as two key causal factors. To address this, we propose Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation), a causally-grounded framework that models hallucination process via a structural causal graph, treating decomposed visual and textual attentions as mediators. We introduce VTACR (Visual-to-Textual Attention Contribution Ratio), a novel metric that quantifies the modality contribution imbalance during decoding. Our analysis reveals that hallucinations frequently occur in low-VTACR scenarios, where textual priors dominate and visual grounding is weakened. To mitigate this, we design a fine-grained attention intervention mechanism that dynamically adjusts token- and layer-wise attention guided by VTACR signals. Finally, we propose a dual-path contrastive decoding strategy: one path emphasizes visually grounded predictions, while the other amplifies hallucinated ones -- letting visual truth shine and hallucination collapse. Experimental results on the POPE and CHAIR benchmarks show that Owl achieves significant hallucination reduction, setting a new SOTA in faithfulness while preserving vision-language understanding capability. Our code is available at https://github.com/CikZ2023/OWL

Problem

Research questions and friction points this paper is trying to address.

Mitigates object hallucination in vision-language models

Addresses imbalance between visual and textual attention mechanisms

Dynamically adjusts cross-modal attention using causal intervention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Models hallucination via structural causal graph

Introduces VTACR metric for modality imbalance

Uses dual-path contrastive decoding strategy

🔎 Similar Papers

DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination