🤖 AI Summary
Existing vision-language models suffer from coarse-grained image-text alignment, leading to fine-grained misalignment that hinders downstream tasks. Current detection methods rely heavily on large-model fine-tuning or labor-intensive manual annotations, resulting in poor efficiency and limited generalization. To address this, we propose a zero-shot dense misalignment detection framework. Our method introduces the first gradient-based, word-level misalignment attribution mechanism—requiring neither fine-tuning nor annotations—and designs F-CLIPScore, a differentiable metric that jointly integrates local misalignment signals with global alignment scores. It supports precise localization of misalignments at the entity, abstract concept, and attribute levels. Built upon pretrained CLIP, our approach enables gradient-based attribution analysis, feature-space disentanglement, and differentiable score aggregation. Evaluated across diverse benchmarks, it achieves zero-shot state-of-the-art performance, significantly outperforms fine-tuning baselines in inference efficiency, and demonstrates accurate identification of object-, concept-, and attribute-level misalignments in qualitative experiments.
📝 Abstract
Recent vision-language foundation models still frequently produce outputs misaligned with their inputs, evidenced by object hallucination in captioning and prompt misalignment in the text-to-image generation model. Recent studies have explored methods for identifying misaligned elements, aiming not only to enhance interpretability but also to improve model performance. However, current approaches primarily rely on large foundation models in a zero-shot manner or fine-tuned models with human annotations, which limits scalability due to significant computational costs. This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP, specifically focusing on pinpointing misaligned words between image and text. We carefully revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment. We also propose F-CLIPScore, which aggregates misaligned attributions with a global alignment score. We evaluate our method on various dense misalignment detection benchmarks, covering various image and text domains and misalignment types. Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models while maintaining superior efficiency. Our qualitative examples show that our method has a unique strength to detect entity-level objects, intangible objects, and attributes that can not be easily detected for existing works. We conduct ablation studies and analyses to highlight the strengths and limitations of our approach. Our code is publicly available at https://github.com/naver-ai/CLIP4DM.