SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

📅 2025-01-12

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address weak generalization and reliance on manually annotated triplets in zero-shot compositional image retrieval, this paper proposes a self-supervised contrastive pretraining framework. It leverages large language models (LLMs) to generate semantically aligned proxy supervision signals—replacing ground-truth target image embeddings—to enable end-to-end training without triplet annotations. The method integrates vision-language models (e.g., CLIP), text-guided embedding composition networks, and generative prompt engineering to construct cross-modal semantic editing representations. On the FashionIQ and CIRR benchmarks, it outperforms all existing zero-shot methods and surpasses most fully supervised approaches, particularly excelling in cross-object and cross-domain retrieval tasks. Its core innovation is the first LLM-based semantic proxy supervision mechanism, enabling purely self-supervised zero-shot compositional retrieval—eliminating the need for any human-annotated triplets or external supervision.

Technology Category

Application Category

📝 Abstract

Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.

Problem

Research questions and friction points this paper is trying to address.

Image Retrieval

Unseen Data Performance

Manual Annotation Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised Learning

Large-scale Language Models

Multimodal Representation

🔎 Similar Papers

iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval