🤖 AI Summary
Existing CLIPScore cannot quantify the intrinsic diversity of text-to-image (T2I) model outputs. To address this, we propose a Schur complement-based decomposition of the CLIP kernel covariance matrix—introducing the Schur complement to CLIP embedding analysis for the first time—to disentangle text-relevant from text-irrelevant components in image embeddings. Based on this decomposition, we define Schur Complement Entropy (SCE) as a novel metric for intrinsic diversity. SCE enables prompt-aware, controllable embedding editing, supporting prompt-level focusing or defocusing. Experiments across multiple T2I models demonstrate that SCE exhibits strong correlation with human diversity assessments (Spearman’s ρ > 0.92) and that edited embeddings significantly enhance controllability and interpretability in downstream tasks.
📝 Abstract
The use of CLIP embeddings to assess the alignment of samples produced by text-to-image generative models has been extensively explored in the literature. While the widely adopted CLIPScore, derived from the cosine similarity of text and image embeddings, effectively measures the relevance of a generated image, it does not quantify the diversity of images generated by a text-to-image model. In this work, we extend the application of CLIP embeddings to quantify and interpret the intrinsic diversity of text-to-image models, which is responsible for generating diverse images from similar text prompts. To achieve this, we propose a decomposition of the CLIP-based kernel covariance matrix of image data into text-based and non-text-based components. Using the Schur complement of the joint image-text kernel covariance matrix, we perform this decomposition and define the matrix-based entropy of the decomposed component as the extit{Schur Complement Entropy (SCE)} score, a measure of the intrinsic diversity of a text-to-image model based on data collected with varying text prompts. Additionally, we demonstrate the use of the Schur complement-based decomposition to nullify the influence of a given prompt in the CLIP embedding of an image, enabling focus or defocus of embeddings on specific objects or properties for downstream tasks. We present several numerical results that apply our Schur complement-based approach to evaluate text-to-image models and modify CLIP image embeddings. The codebase is available at https://github.com/aziksh-ospanov/CLIP-DISSECTION