🤖 AI Summary
This work addresses object hallucination in vision-language models—wherein models generate descriptions of objects not present in images—thereby undermining their reliability. The authors propose a dual-pathway circuit analysis framework that leverages activation patching to identify the correct pathway responsible for visual grounding and the erroneous pathway causing hallucinations. Crucially, they uncover a polarity inversion mechanism in neural activations between these two pathways. Building on Conditional Pathway Analysis (CPA) and directional component suppression, the method enables causal intervention across multiple vision-language models, reducing object hallucination by up to 76% while preserving task accuracy almost entirely. The identified circuit architecture exhibits consistency across models and generalizes effectively to relational hallucination tasks.
📝 Abstract
Vision-language models (VLMs) have demonstrated remarkable capabilities in bridging visual perception and natural language understanding, enabling a wide range of multimodal reasoning tasks. However, they often produce object hallucinations, describing content absent from the input image, which limits their reliability and interpretability. To address this limitation, we propose Dual-Pathway Circuit Analysis, a framework that identifies and characterizes hallucination-related circuits in VLMs for mechanistic understanding and causal probing. We first apply activation patching across five architecturally diverse VLMs to identify a visual grounding pathway that supports correct predictions and a hallucination pathway that drives erroneous outputs. We then introduce Conditional Pathway Analysis (CPA) to characterize pathway-level interactions, revealing that grounding components remain strongly redundant in both correct and hallucinating samples but undergo a consistent polarity flip, shifting from supporting the ground truth on correct samples to aligning with the hallucinated answer on erroneous ones. We further perform targeted suppression of hallucination-pathway components, showing that scaling these components reduces object hallucination by up to 76% with minimal accuracy cost, and validate that the same circuit selectively transfers to relational but not attribute hallucination. Evaluations on POPE-adversarial and AMBER show that the identified circuits are consistent across architectures, support causal intervention, and transfer selectively across hallucination types.