Learning Sparse Disentangled Representations for Multimodal Exclusion Retrieval

📅 2025-04-04

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the poor interpretability, dimensional redundancy, and weak exclusion-based query control of dense embeddings in cross-modal retrieval, this paper proposes the first token-free, compact, sparse, and disentangled multimodal representation method. Our approach integrates contrastive learning with structured sparsity constraints into image–text joint encoding, introducing an explicit factor separation mechanism that jointly achieves fixed-dimensional representations, high sparsity, and semantic disentanglement. Evaluated on MSCOCO and Conceptual Captions, our method achieves up to 11% higher AP@10 than CLIP and up to 21% higher than the sparse baseline VDR. Qualitative analysis demonstrates strong interpretability and fine-grained exclusion control—enabling precise suppression of irrelevant semantic factors during retrieval. This work establishes a new paradigm for interpretable, efficient, and controllable cross-modal representation learning without reliance on textual tokenization.

Technology Category

Application Category

📝 Abstract

Multimodal representations are essential for cross-modal retrieval, but they often lack interpretability, making it difficult to understand the reasoning behind retrieved results. Sparse disentangled representations offer a promising solution; however, existing methods rely heavily on text tokens, resulting in high-dimensional embeddings. In this work, we propose a novel approach that generates compact, fixed-size embeddings that maintain disentanglement while providing greater control over retrieval tasks. We evaluate our method on challenging exclusion queries using the MSCOCO and Conceptual Captions benchmarks, demonstrating notable improvements over dense models like CLIP, BLIP, and VISTA (with gains of up to 11% in AP@10), as well as over sparse disentangled models like VDR (achieving up to 21% gains in AP@10). Furthermore, we present qualitative results that emphasize the enhanced interpretability of our disentangled representations.

Problem

Research questions and friction points this paper is trying to address.

Improving interpretability of multimodal retrieval representations

Reducing high-dimensional embeddings in sparse disentangled models

Enhancing control and performance in exclusion query tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates compact fixed-size disentangled embeddings

Improves interpretability in multimodal retrieval

Outperforms dense and sparse models significantly

🔎 Similar Papers

What to align in multimodal contrastive learning?