Is the Modality Gap a Bug or a Feature? A Robustness Perspective

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the origin of the modality gap between image and text embeddings in shared representation spaces of multimodal models and its impact on robustness. Theoretical analysis reveals that contrastive loss minimization, under certain conditions, induces a global offset vector orthogonal to the embedding space, separating the two modalities. Building on this insight, the authors propose a simple post-processing strategy: translating the embeddings of one modality to align with the mean of the other, thereby substantially reducing the gap. Experiments demonstrate that this approach consistently enhances robustness against embedding perturbations across multiple vision-language models without compromising accuracy on clean samples. This work establishes, for the first time, a monotonic relationship between the modality gap and model robustness.
📝 Abstract
Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.
Problem

Research questions and friction points this paper is trying to address.

modality gap
multi-modal models
embedding space
robustness
contrastive loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

modality gap
contrastive learning
embedding space
robustness
post-processing
🔎 Similar Papers
No similar papers found.
R
Rhea Chowers
Hebrew University
O
Oshri Naparstek
IBM Research
U
Udi Barzelay
IBM Research
Yair Weiss
Yair Weiss
Professor of Computer Science, Hebrew University
Machine LearningComputer VisionHuman Vision