Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing CLIP-style multimodal contrastive pretraining frameworks fail to explicitly leverage the inherent linear structure of pointwise mutual information (PMI) in similarity modeling, leading to deviations from the theoretically optimal similarity metric. To address this, we propose KME-CLIP—the first framework to systematically incorporate PMI’s linearity into contrastive learning. Specifically, it models the optimal cross-modal similarity directly via kernel inner products in a reproducing kernel Hilbert space (RKHS), enabling arbitrary-precision approximation of the PMI-defined theoretical optimum without explicit PMI estimation. This design efficiently captures higher-order statistical dependencies. We provide rigorous theoretical guarantees on convergence and optimality. Empirically, KME-CLIP achieves significant improvements over standard CLIP on image–text retrieval and zero-shot classification, demonstrating both strong theoretical foundations and consistent performance gains.

Technology Category

Application Category

📝 Abstract

In this study, we propose an enhancement to the similarity computation mechanism in multi-modal contrastive pretraining frameworks such as CLIP. Prior theoretical research has demonstrated that the optimal similarity metrics between paired modalities should correspond to the pointwise mutual information (PMI) between the two modalities. However, the current implementations of CLIP and its variants fail to fully utilize the underlying linear structure of PMI. We therefore propose KME-CLIP, which leverages this structure through the inner product in a reproducing kernel Hilbert space. We theoretically prove that our method can approximate PMI with arbitrary accuracy and empirically demonstrate that our approach overall outperforms the standard CLIP formulation across several retrieval and classification tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing similarity computation in multi-modal contrastive pretraining frameworks

Utilizing linear structure of optimal similarity metrics for PMI approximation

Improving retrieval and classification performance over standard CLIP

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages linear structure of PMI in CLIP

Uses kernel Hilbert space inner product

Approximates pointwise mutual information accurately

🔎 Similar Papers

No similar papers found.