CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address key bottlenecks in zero-shot compositional image retrieval (ZS-CIR)—including multimodal query incompatibility, loss of visual detail, and coarse-grained inference—this paper proposes CIRCoT, the first Chain-of-Thought (CoT)-guided fine-grained multi-scale reasoning framework. CIRCoT requires no fine-tuning or training: it leverages large vision-language models (LVLMs) to jointly encode reference images and textual modifications, and decomposes the retrieval task into subtasks for joint semantic-structural reasoning. A Multi-Granularity Scoring (MGS) mechanism is introduced to explicitly model object-level existence and attribute changes atop global semantic matching. Integrated with CLIP similarity fusion and a parameter-free retrieval architecture, CIRCoT achieves state-of-the-art performance across four major benchmarks. It delivers high accuracy, strong interpretability, and zero-training deployability.

Technology Category

Application Category

📝 Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query (reference image and modification text) without training samples. Existing methods primarily combine caption models and large language models (LLMs) to generate target captions based on composed queries but face various issues such as incompatibility, visual information loss, and insufficient reasoning. In this work, we propose CoTMR, a training-free framework crafted for ZS-CIR with novel Chain-of-thought (CoT) and Multi-scale Reasoning. Instead of relying on caption models for modality transformation, CoTMR employs the Large Vision-Language Model (LVLM) to achieve unified understanding and reasoning for composed queries. To enhance the reasoning reliability, we devise CIRCoT, which guides the LVLM through a step-by-step inference process using predefined subtasks. Considering that existing approaches focus solely on global-level reasoning, our CoTMR incorporates multi-scale reasoning to achieve more comprehensive inference via fine-grained predictions about the presence or absence of key elements at the object scale. Further, we design a Multi-Grained Scoring (MGS) mechanism, which integrates CLIP similarity scores of the above reasoning outputs with candidate images to realize precise retrieval. Extensive experiments demonstrate that our CoTMR not only drastically outperforms previous methods across four prominent benchmarks but also offers appealing interpretability.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot composed image retrieval without training samples

Addressing incompatibility and visual information loss in existing methods

Enhancing reasoning reliability and precision through multi-scale analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Large Vision-Language Model for unified query understanding

Implements Chain-of-Thought reasoning for step-by-step inference

Incorporates multi-scale reasoning for detailed element analysis

🔎 Similar Papers

iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval