Explaining Vision-Language Similarities in Dual Encoders with Feature-Pair Attributions

📅 2024-08-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

The cross-modal similarity prediction mechanism in dual-encoder models (e.g., CLIP) remains opaque, particularly regarding fine-grained interactions between image regions and text tokens. Method: We propose the first second-order feature-pair attribution method for dual encoders, grounded in differentiable second-order Taylor expansions to quantify interaction importance between image patches and text spans. Contribution/Results: Our analysis reveals that similarity predictions rely predominantly on cross-modal feature coupling—not unimodal feature contributions—and exhibit strong class dependence and out-of-distribution sensitivity. By clustering error patterns, we identify three canonical failure modes: insufficient object coverage, anomalous scenes, and contextual confusion—enabling interpretable localization of individual prediction errors. This work establishes a new paradigm for explainability in dual-encoder models and provides a reproducible analytical toolkit for rigorous, fine-grained attribution.

Technology Category

Application Category

📝 Abstract

Dual encoder architectures like CLIP models map two types of inputs into a shared embedding space and predict similarities between them. Despite their success, it is, however, not understood how these models compare their two inputs. Common first-order feature-attribution methods can only provide limited insights into dual-encoders since their predictions depend on feature-interactions rather than on individual features. In this paper, we first derive a second-order method enabling the attribution of predictions by any differentiable dual encoder onto feature-interactions between its inputs. Second, we apply our method to CLIP models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes also account for mismatches. This visual-linguistic grounding ability, however, varies heavily between object classes and exhibits pronounced out-of-domain effects. We can identify individual errors as well as systematic failure categories including object coverage, unusual scenes and correlated contexts.

Problem

Research questions and friction points this paper is trying to address.

Understanding feature-interaction in dual-encoder models

Analyzing fine-grained caption-image correspondences in CLIP

Identifying systematic errors in visual-linguistic grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Second-order method for feature-interaction attribution

Application to CLIP models for fine-grained analysis

Identification of systematic errors in object matching

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts