Explaining Vision-Language Similarities in Dual Encoders with Feature-Pair Attributions

📅 2024-08-26
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The cross-modal similarity prediction mechanism in dual-encoder models (e.g., CLIP) remains opaque, particularly regarding fine-grained interactions between image regions and text tokens. Method: We propose the first second-order feature-pair attribution method for dual encoders, grounded in differentiable second-order Taylor expansions to quantify interaction importance between image patches and text spans. Contribution/Results: Our analysis reveals that similarity predictions rely predominantly on cross-modal feature coupling—not unimodal feature contributions—and exhibit strong class dependence and out-of-distribution sensitivity. By clustering error patterns, we identify three canonical failure modes: insufficient object coverage, anomalous scenes, and contextual confusion—enabling interpretable localization of individual prediction errors. This work establishes a new paradigm for explainability in dual-encoder models and provides a reproducible analytical toolkit for rigorous, fine-grained attribution.

Technology Category

Application Category

📝 Abstract
Dual encoder architectures like CLIP models map two types of inputs into a shared embedding space and predict similarities between them. Despite their success, it is, however, not understood how these models compare their two inputs. Common first-order feature-attribution methods can only provide limited insights into dual-encoders since their predictions depend on feature-interactions rather than on individual features. In this paper, we first derive a second-order method enabling the attribution of predictions by any differentiable dual encoder onto feature-interactions between its inputs. Second, we apply our method to CLIP models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes also account for mismatches. This visual-linguistic grounding ability, however, varies heavily between object classes and exhibits pronounced out-of-domain effects. We can identify individual errors as well as systematic failure categories including object coverage, unusual scenes and correlated contexts.
Problem

Research questions and friction points this paper is trying to address.

Understanding feature-interaction in dual-encoder models
Analyzing fine-grained caption-image correspondences in CLIP
Identifying systematic errors in visual-linguistic grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Second-order method for feature-interaction attribution
Application to CLIP models for fine-grained analysis
Identification of systematic errors in object matching
L
Lucas Möller
Institute for Natural Language Processing, University of Stuttgart
Pascal Tilli
Pascal Tilli
Institute for Natural Language Processing, University of Stuttgart
N
Ngoc Thang Vu
Institute for Natural Language Processing, University of Stuttgart
Sebastian Padó
Sebastian Padó
Professor of Computational Linguistics, Computer Science, Stuttgart University
Natural Language ProcessingSemanticsComputational Linguistics