π€ AI Summary
Object hallucination in large vision-language models (VLMs) severely hinders their safe deployment. Existing detection methods typically rely on either global or local representations alone, limiting robustness and generalization. To address this, we propose GLSimβthe first training-free object hallucination detection framework that jointly leverages global and local cross-modal embedding similarities. Built upon aligned vision-language models (e.g., CLIP), GLSim simultaneously evaluates semantic consistency between image and text at both holistic and fine-grained regional levels, overcoming the limitations of single-perspective approaches. Extensive experiments across multiple standard benchmarks demonstrate that GLSim significantly outperforms state-of-the-art methods in accuracy, robustness, and zero-shot generalization. It establishes a new, efficient, and reliable paradigm for trustworthy VLM deployment.
π Abstract
Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.