🤖 AI Summary
To address weak generalization and reliance on manually annotated triplets in zero-shot compositional image retrieval, this paper proposes a self-supervised contrastive pretraining framework. It leverages large language models (LLMs) to generate semantically aligned proxy supervision signals—replacing ground-truth target image embeddings—to enable end-to-end training without triplet annotations. The method integrates vision-language models (e.g., CLIP), text-guided embedding composition networks, and generative prompt engineering to construct cross-modal semantic editing representations. On the FashionIQ and CIRR benchmarks, it outperforms all existing zero-shot methods and surpasses most fully supervised approaches, particularly excelling in cross-object and cross-domain retrieval tasks. Its core innovation is the first LLM-based semantic proxy supervision mechanism, enabling purely self-supervised zero-shot compositional retrieval—eliminating the need for any human-annotated triplets or external supervision.
📝 Abstract
Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.