Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of supervised compositional image retrieval (CIR)—namely, reliance on auxiliary ranking models or intricate prompt engineering, and difficulty in directly optimizing pretrained vision-language models—this paper proposes a training-free, inference-enhanced representation method. Our approach tackles these challenges through two key innovations: (1) a pyramid matching model that enables fine-grained vision-language alignment via multi-granularity patch-level correspondence; and (2) a chain-of-thought (CoT) representation injection mechanism that implicitly encodes structured reasoning capabilities into the model’s embeddings—without explicit textual reasoning or architectural modification. The method requires neither fine-tuning nor external rankers. Extensive experiments demonstrate substantial improvements in retrieval accuracy, achieving new state-of-the-art performance on standard benchmarks including FashionIQ and CIRR. Code and models are publicly available.

Technology Category

Application Category

📝 Abstract
Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited -- compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model's understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public.
Problem

Research questions and friction points this paper is trying to address.

Improves supervised composed image retrieval via reasoning-augmented representations
Enhances visual understanding without additional ranking model training
Achieves better performance in supervised CIR without explicit textual reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pyramid Patcher enhances multi-granularity visual understanding
Representation engineering injects COT data into LVLMs
Training-Free Refinement improves retrieval scores without retraining
🔎 Similar Papers
No similar papers found.
J
Jun Li
Kuaishou Technology
K
Kai Li
Kuaishou Technology
Shaoguo Liu
Shaoguo Liu
Alibaba Corporation
Maching LearningComputer Vision
T
Tingting Gao
Kuaishou Technology