Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from object hallucination—generating content inconsistent with visual inputs. Existing approaches typically modulate visual or textual attention in isolation, neglecting their causal interplay. This paper proposes a causality-driven dual-path attention intervention framework. First, we construct a causal graph to model the hallucination generation process. Second, we define the Vision-Text Attention Contribution Ratio (VTACR) to quantify cross-modal imbalance. Third, we introduce fine-grained token-level and layer-wise attention reweighting, coupled with dual-path contrastive decoding that jointly optimizes visual grounding and hallucination suppression. Evaluated on the POPE and CHAIR benchmarks, our method significantly reduces hallucination rates, achieving state-of-the-art visual faithfulness while preserving strong cross-modal reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Object hallucination remains a critical challenge in Large Vision-Language Models (LVLMs), where models generate content inconsistent with visual inputs. Existing language-decoder based mitigation approaches often regulate visual or textual attention independently, overlooking their interaction as two key causal factors. To address this, we propose Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation), a causally-grounded framework that models hallucination process via a structural causal graph, treating decomposed visual and textual attentions as mediators. We introduce VTACR (Visual-to-Textual Attention Contribution Ratio), a novel metric that quantifies the modality contribution imbalance during decoding. Our analysis reveals that hallucinations frequently occur in low-VTACR scenarios, where textual priors dominate and visual grounding is weakened. To mitigate this, we design a fine-grained attention intervention mechanism that dynamically adjusts token- and layer-wise attention guided by VTACR signals. Finally, we propose a dual-path contrastive decoding strategy: one path emphasizes visually grounded predictions, while the other amplifies hallucinated ones -- letting visual truth shine and hallucination collapse. Experimental results on the POPE and CHAIR benchmarks show that Owl achieves significant hallucination reduction, setting a new SOTA in faithfulness while preserving vision-language understanding capability. Our code is available at https://github.com/CikZ2023/OWL
Problem

Research questions and friction points this paper is trying to address.

Mitigates object hallucination in vision-language models
Addresses imbalance between visual and textual attention mechanisms
Dynamically adjusts cross-modal attention using causal intervention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Models hallucination via structural causal graph
Introduces VTACR metric for modality imbalance
Uses dual-path contrastive decoding strategy
🔎 Similar Papers
No similar papers found.
L
Liu Yu
University of Electronic Science and Technology of China
Z
Zhonghao Chen
University of Electronic Science and Technology of China
P
Ping Kuang
University of Electronic Science and Technology of China
Z
Zhikun Feng
University of Electronic Science and Technology of China
F
Fan Zhou
University of Electronic Science and Technology of China
L
Lan Wang
University of Electronic Science and Technology of China
Gillian Dobbie
Gillian Dobbie
University of Auckland
Big DataStream Data MiningKeyword QueriesData ManagementSoftware Engineering