How Semantically Informative is an Image?: Measuring the Covariance-Weighted Norm of Contrastive Learning Embeddings

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenge of quantifying “absolute semantic information content” in contrastive learning image embeddings. We propose the first cross-modal information gain metric framework, extending classical information gain to the vision–language joint embedding space. Our core innovation defines a covariance-weighted perturbation strength of image features relative to text embedding distributions, formalized as the norm of mean shift and covariance change—yielding a computationally efficient, parameter-free, and sample-agnostic measure of semantic richness. The method operates directly on off-the-shelf CLIP or SigLIP embeddings, integrating Skip-Gram–inspired negative sampling principles without fine-tuning or additional annotations. Evaluated on OpenCLIP, our metric reliably identifies low-information images (e.g., placeholders), achieving near-perfect agreement (coefficient of determination: 0.98–1.00) between CLIP and SigLIP results. It demonstrates high computational efficiency, scalability, and strong cross-model consistency.

Technology Category

Application Category

📝 Abstract

Contrastive learning has the capacity to model multimodal probability distributions by embedding and aligning visual representations with semantics from captions. This approach enables the estimation of relational semantic similarity; however, it remains unclear whether it can also represent absolute semantic informativeness. In this work, we introduce a semantic informativeness metric for an image calculated from text samples via a contrastive learning model; similarly, the informativeness of a text is calculated from image samples. We propose a redefinition of the concept of Information Gain, a concept previously explored in natural language processing, extending its application to the domains of vision and language. Our metric quantifies how conditioning on an image distorts the distribution of associated texts, and vice versa for text conditioning on image distributions. In OpenCLIP's empirical results, we observe that images with the lowest Information Gain scores often correspond to placeholder icons such as "image not found." Furthermore, we propose to measure a norm-based metric of the embedding to estimate the Information Gain, following the theoretical results for Skip-Gram with Negative Sampling (SGNS) word embedding. Information Gain can be measured using either CLIP or SigLIP, and the results demonstrate a strong correlation with a coefficient of determination ranging from 0.98 to 1.00. After obtaining the mean and the covariance of the sample embedding, the computational cost of this method is independent of the sample size, and it is compatible with publicly available, open-weight models.

Problem

Research questions and friction points this paper is trying to address.

Measure semantic informativeness of images and texts

Redefine Information Gain for vision and language

Estimate Information Gain using embedding norms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning aligns visual and text semantics

Redefines Information Gain for vision and language

Norm-based metric estimates semantic informativeness efficiently

🔎 Similar Papers

No similar papers found.