Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current hallucination detection methods for large vision-language models predominantly rely on global image-level scoring, which struggles to capture fine-grained errors. This work proposes a patch-level, fine-grained token grounding analysis framework that identifies hallucinated generations by modeling the interactions between tokens and localized visual regions across the model’s layers. We observe that hallucinated tokens typically exhibit diffuse attention patterns and lack semantic alignment with corresponding image patches. Leveraging this insight, we introduce fine-grained structural analysis into hallucination detection for the first time. Our approach combines patch-level statistical features with hidden-layer representations to enable lightweight, interpretable, token-level detection. It achieves up to 90% accuracy across multiple benchmarks, substantially outperforming existing global strategies.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers. Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention seen in faithful tokens; and (ii) they fail to exhibit meaningful semantic alignment with any visual region. Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.

Problem

Research questions and friction points this paper is trying to address.

hallucination detection

large vision-language models

token grounding

fine-grained analysis

visual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained token grounding

hallucination detection

patch-level analysis