π€ AI Summary
Current hallucination detection methods for large vision-language models predominantly rely on global image-level scoring, which struggles to capture fine-grained errors. This work proposes a patch-level, fine-grained token grounding analysis framework that identifies hallucinated generations by modeling the interactions between tokens and localized visual regions across the modelβs layers. We observe that hallucinated tokens typically exhibit diffuse attention patterns and lack semantic alignment with corresponding image patches. Leveraging this insight, we introduce fine-grained structural analysis into hallucination detection for the first time. Our approach combines patch-level statistical features with hidden-layer representations to enable lightweight, interpretable, token-level detection. It achieves up to 90% accuracy across multiple benchmarks, substantially outperforming existing global strategies.
π Abstract
Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers. Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention seen in faithful tokens; and (ii) they fail to exhibit meaningful semantic alignment with any visual region. Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.