🤖 AI Summary
This work identifies information imbalance between image and text representations as the unified root cause of both modality gap and object bias in contrastive vision-language models (VLMs). We analyze embedding-space geometry, measure logits entropy, and conduct controlled information-sharing experiments to formally define and quantify object bias for the first time. Our analysis reveals that modality separation is predominantly driven by a small subset of embedding dimensions and that image and text representations exhibit fundamentally distinct organizational principles in latent space. Key contributions include: (1) establishing information imbalance as the shared triggering mechanism for both phenomena; (2) empirically demonstrating a strong correlation between modality gap and logits distribution entropy; and (3) showing that mitigating the modality gap significantly improves downstream performance—especially on attribute recognition—while object bias itself does not inherently impair recognition of non-object concepts. These findings provide theoretical grounding and actionable optimization pathways for representation disentanglement and task generalization in VLMs.
📝 Abstract
Contrastive vision-language models (VLMs), like CLIP, have gained popularity for their versatile applicability to various downstream tasks. Despite their successes in some tasks, like zero-shot object recognition, they perform surprisingly poor on other tasks, like attribute recognition. Previous work has attributed these challenges to the modality gap, a separation of image and text in the shared representation space, and to a bias towards objects over other factors, such as attributes. In this analysis paper, we investigate both phenomena thoroughly. We evaluated off-the-shelf VLMs and while the gap's influence on performance is typically overshadowed by other factors, we find indications that closing the gap indeed leads to improvements. Moreover, we find that, contrary to intuition, only few embedding dimensions drive the gap and that the embedding spaces are differently organized. To allow for a clean study of object bias, we introduce a definition and a corresponding measure of it. Equipped with this tool, we find that object bias does not lead to worse performance on other concepts, such as attributes per se. However, why do both phenomena, modality gap and object bias, emerge in the first place? To answer this fundamental question and uncover some of the inner workings of contrastive VLMs, we conducted experiments that allowed us to control the amount of shared information between the modalities. These experiments revealed that the driving factor behind both the modality gap and the object bias, is an information imbalance between images and captions, and unveiled an intriguing connection between the modality gap and entropy of the logits.