Vision-Encoders (Already) Know What They See: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) suffer from pervasive object hallucination. Challenging the prevailing assumption that insufficient visual encoder representational capacity causes hallucination, this work demonstrates that standard visual encoders already possess high discriminative power—requiring no architectural expansion. To this end, we propose Fine-grained CLIPScore (F-CLIPScore), the first zero-shot, training-free hallucination evaluation metric aligned at the noun phrase level. Leveraging the CLIP architecture, F-CLIPScore computes contrastive similarity between noun-phrase-level text embeddings and localized image features, substantially improving fine-grained cross-modal alignment accuracy. On the OHD-Caps benchmark, F-CLIPScore achieves a 39.6% absolute accuracy gain over vanilla CLIPScore. When used to filter training data, it yields LVLMs with significantly reduced hallucination rates. Our core contribution is twofold: (1) revealing the latent hallucination-detection capability inherent in off-the-shelf visual encoders, and (2) establishing the first noun-phrase-level, training-free, high-accuracy hallucination evaluation paradigm.

Technology Category

Application Category

📝 Abstract

Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains. However, these models suffer from object hallucination. This study revisits the previous claim that the primary cause of such hallucination lies in the limited representational capacity of the vision encoder. Our analysis reveals that the capacity of the vision encoder itself is already enough for detecting object hallucination. Based on this insight, we propose a Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun phrase level. Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of 39.6% without additional training. We further validate F-CLIPScore by showing that LVLM trained with the data filtered using F-CLIPScore exhibits reduced hallucination.

Problem

Research questions and friction points this paper is trying to address.

Mitigating object hallucination in LVLMs

Enhancing object-level granularity with F-CLIPScore

Validating LVLM performance with reduced hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained CLIPScore enhances granularity

Incorporates text embeddings at noun level

Reduces object hallucination in LVLMs

🔎 Similar Papers

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models