🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from object hallucination—generating content inconsistent with visual inputs. Existing approaches typically modulate visual or textual attention in isolation, neglecting their causal interplay. This paper proposes a causality-driven dual-path attention intervention framework. First, we construct a causal graph to model the hallucination generation process. Second, we define the Vision-Text Attention Contribution Ratio (VTACR) to quantify cross-modal imbalance. Third, we introduce fine-grained token-level and layer-wise attention reweighting, coupled with dual-path contrastive decoding that jointly optimizes visual grounding and hallucination suppression. Evaluated on the POPE and CHAIR benchmarks, our method significantly reduces hallucination rates, achieving state-of-the-art visual faithfulness while preserving strong cross-modal reasoning capabilities.
📝 Abstract
Object hallucination remains a critical challenge in Large Vision-Language Models (LVLMs), where models generate content inconsistent with visual inputs. Existing language-decoder based mitigation approaches often regulate visual or textual attention independently, overlooking their interaction as two key causal factors. To address this, we propose Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation), a causally-grounded framework that models hallucination process via a structural causal graph, treating decomposed visual and textual attentions as mediators. We introduce VTACR (Visual-to-Textual Attention Contribution Ratio), a novel metric that quantifies the modality contribution imbalance during decoding. Our analysis reveals that hallucinations frequently occur in low-VTACR scenarios, where textual priors dominate and visual grounding is weakened. To mitigate this, we design a fine-grained attention intervention mechanism that dynamically adjusts token- and layer-wise attention guided by VTACR signals. Finally, we propose a dual-path contrastive decoding strategy: one path emphasizes visually grounded predictions, while the other amplifies hallucinated ones -- letting visual truth shine and hallucination collapse. Experimental results on the POPE and CHAIR benchmarks show that Owl achieves significant hallucination reduction, setting a new SOTA in faithfulness while preserving vision-language understanding capability. Our code is available at https://github.com/CikZ2023/OWL