🤖 AI Summary
This work addresses the semantic entanglement and limited fine-grained controllability inherent in vision-language joint embedding spaces (e.g., CLIP). To this end, we propose Sparse Linear Concept Subspaces (SLiCS), a disentanglement method that decomposes joint embeddings into multiple concept-specific, semantically orthogonal, and sparse component vectors. SLiCS employs a multi-label supervised alternating optimization algorithm to construct a dictionary with group structure, while integrating text-embedding-driven concept atomic grouping and cross-modal alignment to ensure semantic consistency and interpretability of the decomposition. Extensive evaluation on CLIP, TiTok, and DINOv2 demonstrates that SLiCS significantly improves concept-filtered image retrieval accuracy, enables zero-shot conditional generation, and supports high-fidelity semantic editing. Theoretically, the optimization algorithm is guaranteed to converge; empirically, SLiCS exhibits strong generalization across diverse vision-language models.
📝 Abstract
Vision-language co-embedding networks, such as CLIP, provide a latent embedding space with semantic information that is useful for downstream tasks. We hypothesize that the embedding space can be disentangled to separate the information on the content of complex scenes by decomposing the embedding into multiple concept-specific component vectors that lie in different subspaces. We propose a supervised dictionary learning approach to estimate a linear synthesis model consisting of sparse, non-negative combinations of groups of vectors in the dictionary (atoms), whose group-wise activity matches the multi-label information. Each concept-specific component is a non-negative combination of atoms associated to a label. The group-structured dictionary is optimized through a novel alternating optimization with guaranteed convergence. Exploiting the text co-embeddings, we detail how semantically meaningful descriptions can be found based on text embeddings of words best approximated by a concept's group of atoms, and unsupervised dictionary learning can exploit zero-shot classification of training set images using the text embeddings of concept labels to provide instance-wise multi-labels. We show that the disentangled embeddings provided by our sparse linear concept subspaces (SLiCS) enable concept-filtered image retrieval (and conditional generation using image-to-prompt) that is more precise. We also apply SLiCS to highly-compressed autoencoder embeddings from TiTok and the latent embedding from self-supervised DINOv2. Quantitative and qualitative results highlight the improved precision of the concept-filtered image retrieval for all embeddings.