GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Object hallucination in large vision-language models (VLMs) severely hinders their safe deployment. Existing detection methods typically rely on either global or local representations alone, limiting robustness and generalization. To address this, we propose GLSim—the first training-free object hallucination detection framework that jointly leverages global and local cross-modal embedding similarities. Built upon aligned vision-language models (e.g., CLIP), GLSim simultaneously evaluates semantic consistency between image and text at both holistic and fine-grained regional levels, overcoming the limitations of single-perspective approaches. Extensive experiments across multiple standard benchmarks demonstrate that GLSim significantly outperforms state-of-the-art methods in accuracy, robustness, and zero-shot generalization. It establishes a new, efficient, and reliable paradigm for trustworthy VLM deployment.

Technology Category

Application Category

📝 Abstract

Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.

Problem

Research questions and friction points this paper is trying to address.

Detecting object hallucinations in vision-language models

Combining global and local similarity for reliable detection

Improving accuracy in diverse real-world scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines global and local similarity signals

Training-free object hallucination detection framework

Leverages complementary image-text embedding similarities

🔎 Similar Papers

No similar papers found.