๐ค AI Summary
This work addresses the challenge in compositional image retrieval where reference images often contain irrelevant visual noise, hindering accurate capture of user intent. To this end, we propose a chain-of-thought reasoning framework grounded in multimodal large language models. Our approach generates โkeepโremoveโinferโ triplet textual instructions to guide a two-level visual attention mechanism that adaptively selects discriminative semantics at both patch-level and instance-level granularities, followed by a weighted fusion of multi-granularity visual and textual cues. By introducing chain-of-thought reasoning into the multi-level visual selection process for the first time, our method significantly enhances focus on intent-relevant semantics. It achieves state-of-the-art performance on both CIRR and FashionIQ benchmarks, substantially outperforming existing approaches.
๐ Abstract
Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user's intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released.