good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval

📅 2025-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-grained textual annotations for Composed Image Retrieval (CIR) are scarce, ambiguous, and lack diversity. Method: We propose the first end-to-end vision-language-driven synthetic annotation generation framework, featuring a novel three-stage structured pipeline: (i) fine-grained object parsing, (ii) cross-image comparable description generation, and (iii) semantic-aligned instruction synthesis—effectively mitigating hallucination while ensuring object-level consistency and modification diversity. Leveraging advanced vision-language models, the framework enables high-quality, cross-domain CIR dataset construction. Contribution/Results: Experiments demonstrate that our generated annotations significantly improve retrieval accuracy across multiple CIR benchmarks and enhance model generalization. The open-sourced framework has accelerated progress in multimodal retrieval research.

Technology Category

Application Category

📝 Abstract
Composed image retrieval (CIR) enables users to search images using a reference image combined with textual modifications. Recent advances in vision-language models have improved CIR, but dataset limitations remain a barrier. Existing datasets often rely on simplistic, ambiguous, or insufficient manual annotations, hindering fine-grained retrieval. We introduce good4cir, a structured pipeline leveraging vision-language models to generate high-quality synthetic annotations. Our method involves: (1) extracting fine-grained object descriptions from query images, (2) generating comparable descriptions for target images, and (3) synthesizing textual instructions capturing meaningful transformations between images. This reduces hallucination, enhances modification diversity, and ensures object-level consistency. Applying our method improves existing datasets and enables creating new datasets across diverse domains. Results demonstrate improved retrieval accuracy for CIR models trained on our pipeline-generated datasets. We release our dataset construction framework to support further research in CIR and multi-modal retrieval.
Problem

Research questions and friction points this paper is trying to address.

Generating detailed synthetic captions for composed image retrieval
Overcoming dataset limitations with simplistic manual annotations
Improving retrieval accuracy with fine-grained object descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic captions using vision-language models
Extracts fine-grained object descriptions from images
Synthesizes textual instructions for image transformations
🔎 Similar Papers
No similar papers found.