🤖 AI Summary
This study challenges the prevailing hypothesis that CLIP’s underperformance on intra-modal tasks stems from its neglect of image–image alignment. Through theoretical analysis and comparative experiments across multiple models—including CLIP, SigLIP, and DINO—the authors demonstrate that the hypothesized degrees-of-freedom issue in the image embedding space does not exist, and similar performance patterns are observed even in models trained solely on images. The work refutes “intra-modal misalignment” as a core bottleneck, revealing instead that ambiguous task definitions—not structural flaws in the embeddings—are the primary factor limiting performance. These findings clarify longstanding misconceptions about CLIP’s embedding properties and underscore the critical importance of well-specified task formulations for accurate model evaluation.
📝 Abstract
Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.