How Semantically Informative is an Image?: Measuring the Covariance-Weighted Norm of Contrastive Learning Embeddings

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of quantifying “absolute semantic information content” in contrastive learning image embeddings. We propose the first cross-modal information gain metric framework, extending classical information gain to the vision–language joint embedding space. Our core innovation defines a covariance-weighted perturbation strength of image features relative to text embedding distributions, formalized as the norm of mean shift and covariance change—yielding a computationally efficient, parameter-free, and sample-agnostic measure of semantic richness. The method operates directly on off-the-shelf CLIP or SigLIP embeddings, integrating Skip-Gram–inspired negative sampling principles without fine-tuning or additional annotations. Evaluated on OpenCLIP, our metric reliably identifies low-information images (e.g., placeholders), achieving near-perfect agreement (coefficient of determination: 0.98–1.00) between CLIP and SigLIP results. It demonstrates high computational efficiency, scalability, and strong cross-model consistency.

Technology Category

Application Category

📝 Abstract
Contrastive learning has the capacity to model multimodal probability distributions by embedding and aligning visual representations with semantics from captions. This approach enables the estimation of relational semantic similarity; however, it remains unclear whether it can also represent absolute semantic informativeness. In this work, we introduce a semantic informativeness metric for an image calculated from text samples via a contrastive learning model; similarly, the informativeness of a text is calculated from image samples. We propose a redefinition of the concept of Information Gain, a concept previously explored in natural language processing, extending its application to the domains of vision and language. Our metric quantifies how conditioning on an image distorts the distribution of associated texts, and vice versa for text conditioning on image distributions. In OpenCLIP's empirical results, we observe that images with the lowest Information Gain scores often correspond to placeholder icons such as "image not found." Furthermore, we propose to measure a norm-based metric of the embedding to estimate the Information Gain, following the theoretical results for Skip-Gram with Negative Sampling (SGNS) word embedding. Information Gain can be measured using either CLIP or SigLIP, and the results demonstrate a strong correlation with a coefficient of determination ranging from 0.98 to 1.00. After obtaining the mean and the covariance of the sample embedding, the computational cost of this method is independent of the sample size, and it is compatible with publicly available, open-weight models.
Problem

Research questions and friction points this paper is trying to address.

Measure semantic informativeness of images and texts
Redefine Information Gain for vision and language
Estimate Information Gain using embedding norms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning aligns visual and text semantics
Redefines Information Gain for vision and language
Norm-based metric estimates semantic informativeness efficiently
🔎 Similar Papers
No similar papers found.