ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing compositional image retrieval (CIR) methods struggle to model fine-grained semantic alignment between textual and visual modalities, limiting retrieval performance. This paper proposes a noun-level fine-grained alignment framework: (1) a text concept consistency loss explicitly enforces attentional matching between noun phrases and corresponding local image regions; (2) a controllable synthetic data generation pipeline unifies zero-shot and supervised training. To our knowledge, this is the first CIR approach grounded in noun-level semantic alignment. Built upon the CLIP architecture, it integrates multimodal representation learning, contrastive learning, and interpretable attention mechanisms. The method achieves new state-of-the-art results—both supervised and zero-shot—on standard benchmarks including CIRR and CIRCO, with substantial gains in retrieval accuracy. Code, pretrained models, and the newly synthesized dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on multiple benchmark datasets, including CIRR and CIRCO. Source code, model checkpoints, and our new datasets are available at https://github.com/mvrl/ConText-CIR.
Problem

Research questions and friction points this paper is trying to address.

Improving composed image retrieval accuracy with text modifications
Enhancing noun phrase representation in text-image attention
Generating synthetic training data for better CIR performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text Concept-Consistency loss for better noun phrase representation
Synthetic data generation pipeline for training
Improved composed image retrieval performance
🔎 Similar Papers
No similar papers found.