GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

πŸ“… 2025-08-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Object hallucination in large vision-language models (VLMs) severely hinders their safe deployment. Existing detection methods typically rely on either global or local representations alone, limiting robustness and generalization. To address this, we propose GLSimβ€”the first training-free object hallucination detection framework that jointly leverages global and local cross-modal embedding similarities. Built upon aligned vision-language models (e.g., CLIP), GLSim simultaneously evaluates semantic consistency between image and text at both holistic and fine-grained regional levels, overcoming the limitations of single-perspective approaches. Extensive experiments across multiple standard benchmarks demonstrate that GLSim significantly outperforms state-of-the-art methods in accuracy, robustness, and zero-shot generalization. It establishes a new, efficient, and reliable paradigm for trustworthy VLM deployment.

Technology Category

Application Category

πŸ“ Abstract
Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.
Problem

Research questions and friction points this paper is trying to address.

Detecting object hallucinations in vision-language models
Combining global and local similarity for reliable detection
Improving accuracy in diverse real-world scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines global and local similarity signals
Training-free object hallucination detection framework
Leverages complementary image-text embedding similarities
πŸ”Ž Similar Papers
No similar papers found.
Seongheon Park
Seongheon Park
University of Wisconsin-Madison
Machine LearningReliable AI
Y
Yixuan Li
Department of Computer Sciences, University of Wisconsin-Madison