HalLoc: Token-level Localization of Hallucinations for Vision Language Models

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Hallucinations in large vision-language models (VLMs) severely undermine reliability, yet existing detection methods incur high computational overhead, introduce significant latency, and struggle with the ambiguous boundary between hallucinated and factual content in real-world scenarios. To address this, we propose HalLoc—the first large-scale, token-level hallucination localization dataset (150K samples)—spanning VQA, instruction-following, and image captioning tasks. We introduce the first token-level hallucination type annotations and a confidence-aware detection paradigm. Our method employs a lightweight, plug-and-play concurrent detector that requires no backbone modification, leveraging multi-task supervision and token-level classification while integrating uncertainty modeling and generation-process–aware collaborative reasoning. Experiments show our approach incurs <5% additional latency and achieves an average hallucination localization accuracy of 92.3% across multiple VLMs, enabling real-time, interpretable hallucination detection.

Technology Category

Application Category

📝 Abstract

Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications. The HalLoc dataset and code are publicly available at: https://github.com/dbsltm/cvpr25_halloc.

Problem

Research questions and friction points this paper is trying to address.

Detect token-level hallucinations in vision-language models

Reduce computational intensity of current detection methods

Address unclear boundaries between hallucinated and truthful information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level annotated dataset for hallucination detection

Low-overhead baseline model for concurrent detection

Plug-and-play module for existing VLMs integration

🔎 Similar Papers

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations