CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of fragmented representation spaces and misalignment in existing compositional image retrieval methods, which typically employ heterogeneous modalities and separate encoders for queries and targets. To overcome this, the authors propose a Multi-granularity Chain-of-Thought (MCoT) prompting mechanism that generates semantically compatible image descriptions, coupled with a symmetric dual-tower architecture sharing a unified Q-Former to enable end-to-end aligned representations for both queries and targets. Additionally, an entropy-based temporal dynamic memory bank is introduced to adaptively supply high-quality negative samples during training. The proposed method achieves state-of-the-art performance across four benchmark datasets, significantly improves training efficiency, and ablation studies confirm the effectiveness of each component.

Technology Category

Application Category

📝 Abstract
Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.
Problem

Research questions and friction points this paper is trying to address.

Composed Image Retrieval
representation space fragmentation
heterogeneous modalities
feature alignment
modality asymmetry
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Prompting
Symmetric Dual-Tower Architecture
Memory Bank
Composed Image Retrieval
Multimodal Alignment
🔎 Similar Papers
No similar papers found.
Z
Zhipeng Qian
Kuaishou Technology
Z
Zihan Liang
Kuaishou Technology
Yufei Ma
Yufei Ma
Peking University
Neural Network AcceleratorComputing-in-MemoryFPGA DesignNeuromorphic Computing
Ben Chen
Ben Chen
KuaiShou, Alibaba, HUST, WHU
MultimodalLLMGenerative RecommendationSemantic Matching
H
Huangyu Dai
Kuaishou Technology
Yiwei Ma
Yiwei Ma
Stevens Institute of Technology
Jiayi Ji
Jiayi Ji
Rutgers University
Chenyi Lei
Chenyi Lei
Kuaishou Technology
Recommender SystemInformation RetrievalGenerative RecommendationMultimodal
H
Han Li
Kuaishou Technology
X
Xiaoshuai Sun
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China