🤖 AI Summary
To address the poor interpretability, dimensional redundancy, and weak exclusion-based query control of dense embeddings in cross-modal retrieval, this paper proposes the first token-free, compact, sparse, and disentangled multimodal representation method. Our approach integrates contrastive learning with structured sparsity constraints into image–text joint encoding, introducing an explicit factor separation mechanism that jointly achieves fixed-dimensional representations, high sparsity, and semantic disentanglement. Evaluated on MSCOCO and Conceptual Captions, our method achieves up to 11% higher AP@10 than CLIP and up to 21% higher than the sparse baseline VDR. Qualitative analysis demonstrates strong interpretability and fine-grained exclusion control—enabling precise suppression of irrelevant semantic factors during retrieval. This work establishes a new paradigm for interpretable, efficient, and controllable cross-modal representation learning without reliance on textual tokenization.
📝 Abstract
Multimodal representations are essential for cross-modal retrieval, but they often lack interpretability, making it difficult to understand the reasoning behind retrieved results. Sparse disentangled representations offer a promising solution; however, existing methods rely heavily on text tokens, resulting in high-dimensional embeddings. In this work, we propose a novel approach that generates compact, fixed-size embeddings that maintain disentanglement while providing greater control over retrieval tasks. We evaluate our method on challenging exclusion queries using the MSCOCO and Conceptual Captions benchmarks, demonstrating notable improvements over dense models like CLIP, BLIP, and VISTA (with gains of up to 11% in AP@10), as well as over sparse disentangled models like VDR (achieving up to 21% gains in AP@10). Furthermore, we present qualitative results that emphasize the enhanced interpretability of our disentangled representations.