Is CLIP ideal? No. Can we fix it? Yes!

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

CLIP’s joint embedding space suffers from inherent geometric limitations, impeding simultaneous accurate modeling of descriptive content, attribute binding, spatial relations, and negation logic between vision and language. This work theoretically identifies— for the first time—the semantic expressivity bottleneck of such contrastive models. To address it, we propose Dense Cosine Similarity Maps (DCSMs), an interpretable, plug-and-play alternative that requires no backbone retraining. DCSMs compute dense cosine similarities between image patches and text tokens, explicitly preserving fine-grained semantic topology in the cross-modal alignment. Evaluated across multiple cross-modal benchmarks, DCSMs consistently outperform CLIP and its variants—especially on fine-grained visual reasoning, spatial relation understanding, and negation detection. Our approach establishes a new paradigm for cross-modal representation learning, offering both theoretical insight and a practical, lightweight tool for semantic grounding without architectural or training overhead.

Technology Category

Application Category

📝 Abstract

Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP

Problem

Research questions and friction points this paper is trying to address.

Analyzes CLIP's latent space limitations in handling complex visual-textual interactions.

Proposes Dense Cosine Similarity Maps to address CLIP's fundamental geometric issues.

Improves CLIP-like models' performance on benchmarks by retaining semantic topology.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes CLIP's latent space limitations

Introduces Dense Cosine Similarity Maps (DCSMs)

Improves CLIP-like models' performance benchmarks

🔎 Similar Papers

No similar papers found.

Authors to Follow