Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
Large Vision-Language Models (LVLMs) frequently hallucinate visual interpretations for purely textual inputs lacking visual grounding, leading to erroneous responses. This work identifies a class of feedforward neurons—termed Visual-Aware (VA) neurons—that exhibit heightened sensitivity to the absence of visual input, revealing an intrinsic mechanism for detecting token-level visual grounding status based on their activation patterns. Building upon this insight, we propose a Visual-Missing-Aware (VMA) module that jointly employs prompt reinterpretation and dynamic token replacement during generation to rectify hallucinated outputs. Our approach requires no model fine-tuning or external annotations. Evaluated across multiple state-of-the-art LVLMs, it significantly reduces the mis-visualization rate for non-visual text inputs. The results empirically validate the existence of an inherent, interpretable visual grounding discriminability within LVLMs and demonstrate strong cross-model generalizability.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) generate contextually relevant responses by jointly interpreting visual and textual inputs. However, our finding reveals they often mistakenly perceive text inputs lacking visual evidence as being part of the image, leading to erroneous responses. In light of this finding, we probe whether LVLMs possess an internal capability to determine if textual concepts are grounded in the image, and discover a specific subset of Feed-Forward Network (FFN) neurons, termed Visual Absence-aware (VA) neurons, that consistently signal the visual absence through a distinctive activation pattern. Leveraging these patterns, we develop a detection module that systematically classifies whether an input token is visually grounded. Guided by its prediction, we propose a method to refine the outputs by reinterpreting question prompts or replacing the detected absent tokens during generation. Extensive experiments show that our method effectively mitigates the models' tendency to falsely presume the visual presence of text input and its generality across various LVLMs.
Problem

Research questions and friction points this paper is trying to address.

Detecting when text tokens lack visual evidence in images
Identifying specific neurons signaling visual absence in LVLMs
Mitigating erroneous responses from visually ungrounded text inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identified Visual Absence-aware FFN neurons
Developed module to classify token grounding
Refined outputs by reinterpreting prompts