IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses a critical limitation of CLIP models: despite their strong cross-modal performance, their unimodal encoders suffer from insufficient intra-modal alignment in tasks such as image-to-image retrieval. We identify that this issue stems from the coupling of cross-modal alignment and intra-modal normalization within CLIP’s projection head. Through spectral analysis, we reveal a shared, approximately isotropic subspace across modalities. Leveraging this insight, we propose a training-free method that decomposes cosine similarity and analytically processes the projector weights to decouple and remove anisotropic directions, thereby isolating a well-aligned subspace. Our approach consistently improves performance on multiple intra-modal retrieval and classification benchmarks, reduces inference latency, and generalizes effectively across various pre-trained CLIP-style models.

Technology Category

Application Category

📝 Abstract

Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.

Problem

Research questions and friction points this paper is trying to address.

intra-modal alignment

CLIP

vision-language models

projectors

misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

intra-modal alignment

CLIP projector decomposition

isotropic subspace