CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing compositional image retrieval (CIR) methods suffer from opaque cross-modal reasoning and poor responsiveness to fine-grained instructions. Method: This paper proposes the first end-to-end Chain-of-Thought (CoT) reasoning framework for CIR, introducing an explicit three-stage CoT process—description → reasoning → conclusion—into multimodal large model–based retrieval. It jointly optimizes vision-language alignment and embedding space to generate interpretable reasoning paths. Contribution/Results: We introduce the first multimodal large model specifically designed for CIR; release a new dataset with structured CoT annotations; achieve state-of-the-art performance on FashionIQ and CIRR; and demonstrate strong cross-domain generalization on the CIRCO benchmark. Our approach significantly improves both interpretability and retrieval accuracy, establishing a new paradigm for transparent, instruction-aware multimodal retrieval.

Technology Category

Application Category

📝 Abstract
Composed Image Retrieval (CIR), which aims to find a target image from a reference image and a modification text, presents the core challenge of performing unified reasoning across visual and semantic modalities. While current approaches based on Vision-Language Models (VLMs, e.g., CLIP) and more recent Multimodal Large Language Models (MLLMs, e.g., Qwen-VL) have shown progress, they predominantly function as ``black boxes." This inherent opacity not only prevents users from understanding the retrieval rationale but also restricts the models' ability to follow complex, fine-grained instructions. To overcome these limitations, we introduce CIR-CoT, the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning. By compelling the model to first generate an interpretable reasoning chain, CIR-CoT enhances its ability to capture crucial cross-modal interactions, leading to more accurate retrieval while making its decision process transparent. Since existing datasets like FashionIQ and CIRR lack the necessary reasoning data, a key contribution of our work is the creation of structured CoT annotations using a three-stage process involving a caption, reasoning, and conclusion. Our model is then fine-tuned to produce this structured output before encoding its final retrieval intent into a dedicated embedding. Comprehensive experiments show that CIR-CoT achieves highly competitive performance on in-domain datasets (FashionIQ, CIRR) and demonstrates remarkable generalization on the out-of-domain CIRCO dataset, establishing a new path toward more effective and trustworthy retrieval systems.
Problem

Research questions and friction points this paper is trying to address.

Enhancing interpretability in composed image retrieval systems
Addressing black-box reasoning limitations in multimodal models
Generating structured reasoning chains for transparent decision-making
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end MLLM with Chain-of-Thought reasoning
Generates interpretable reasoning chain before retrieval
Fine-tuned using structured three-stage annotation process
🔎 Similar Papers
No similar papers found.
Weihuang Lin
Weihuang Lin
Xiamen University
Multimodal Learning
Yiwei Ma
Yiwei Ma
Stevens Institute of Technology
Jiayi Ji
Jiayi Ji
Rutgers University
X
Xiaoshuai Sun
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China