SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

📅 2025-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak generalization and reliance on manually annotated triplets in zero-shot compositional image retrieval, this paper proposes a self-supervised contrastive pretraining framework. It leverages large language models (LLMs) to generate semantically aligned proxy supervision signals—replacing ground-truth target image embeddings—to enable end-to-end training without triplet annotations. The method integrates vision-language models (e.g., CLIP), text-guided embedding composition networks, and generative prompt engineering to construct cross-modal semantic editing representations. On the FashionIQ and CIRR benchmarks, it outperforms all existing zero-shot methods and surpasses most fully supervised approaches, particularly excelling in cross-object and cross-domain retrieval tasks. Its core innovation is the first LLM-based semantic proxy supervision mechanism, enabling purely self-supervised zero-shot compositional retrieval—eliminating the need for any human-annotated triplets or external supervision.

Technology Category

Application Category

📝 Abstract
Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.
Problem

Research questions and friction points this paper is trying to address.

Image Retrieval
Unseen Data Performance
Manual Annotation Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised Learning
Large-scale Language Models
Multimodal Representation
🔎 Similar Papers
No similar papers found.
Bhavin Jawade
Bhavin Jawade
University at Buffalo, Netflix Research, Ex Yahoo Research, Ex Adobe Research
Vision LanguageComputer VisionMachine LearningCross Modal
J
Joao V. B. Soares
Yahoo Research
K
K. Thadani
Yahoo Research
D
Deen Dayal Mohan
Yahoo Research
A
Amir Erfan Eshratifar
Yahoo Research
P
Paloma de Juan
Yahoo Research
Srirangaraj Setlur
Srirangaraj Setlur
Principal Research Scientist, State University of New York at Buffalo
Pattern RecognitionDocument AnalysisHandwriting RecognitionBiometrics
V
Venugopal Govindaraju
University at Buffalo, SUNY