🤖 AI Summary
This work addresses the challenge of hallucination in vision-language models (VLMs), which existing detection methods tackle only after text generation, resulting in latency and high computational cost. We propose an early hallucination risk prediction framework that probes internal model representations via a single forward pass—before any token is generated. To our knowledge, this is the first demonstration that such risk can be effectively detected prior to generation. Our analysis reveals that the optimal layer and modality for probing vary across VLM architectures. Using three types of representations—pure visual features, visual tokens, and multimodal fused query tokens—we train lightweight probes across eight prominent VLMs. Our approach achieves up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B, and 0.79 AUROC on Qwen2.5-VL-7B using pure visual features alone.
📝 Abstract
Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.