Whitened CLIP as a Likelihood Surrogate of Images and Captions

📅 2025-05-11

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the lack of efficient and interpretable methods for image–text cross-modal likelihood estimation. We propose a training-free, invertible whitening transformation for CLIP embeddings, which linearly maps raw CLIP latent-space embeddings to a standard normal distribution—thereby reducing log-likelihood estimation to an analytical function of Euclidean distance in the whitened space. Our key contribution is the first identity-whitening of CLIP embedding covariances without fine-tuning or additional parameters, enabling rigorous statistical interpretation of implicit likelihood while preserving the frozen CLIP model. Experiments confirm that whitened embeddings closely follow a standard normal distribution (Kolmogorov–Smirnov test, *p* > 0.05), enabling millisecond-scale likelihood computation for both images and text. The method demonstrates strong discriminability and generalization potential in cross-modal retrieval and anomaly detection tasks.

Technology Category

Application Category

📝 Abstract

Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce extit{Whitened CLIP}, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embeddings statistics can be well approximated as a standard normal distribution, thus, the log-likelihood is estimated simply by the square Euclidean norm in the whitened embedding space. The whitening procedure is completely training-free and performed using a pre-computed whitening matrix, hence, is very fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions.

Problem

Research questions and friction points this paper is trying to address.

Estimating likelihood of images and captions efficiently

Transforming CLIP latent space for normal distribution approximation

Enabling training-free fast likelihood scoring for multimedia

Innovation

Methods, ideas, or system contributions that make the work stand out.

Whitened CLIP transforms latent space invertibly

Whitening ensures zero mean and unit variance

Training-free fast whitening with pre-computed matrix

🔎 Similar Papers

Linear Alignment of Vision-language Models for Image Captioning