🤖 AI Summary
This work addresses the semantic bias introduced by vague textual descriptions generated by multimodal large language models in zero-shot compositional image retrieval, which often degrades retrieval accuracy. To mitigate this issue, the authors propose a training-free semantic debiasing reranking framework that leverages a selective chain-of-thought prompting strategy to guide the model toward salient visual content. The framework incorporates a two-stage mechanism: an anchoring stage that enriches missing semantic cues and a debiasing stage that explicitly corrects description bias by suppressing redundant information through a penalty term. This approach represents the first explicit semantic debiasing mechanism tailored for zero-shot compositional image retrieval, achieving state-of-the-art performance among single-stage methods on three standard CIR benchmarks while maintaining both efficiency and accuracy.
📝 Abstract
Composed Image Retrieval (CIR) aims to retrieve a target image from a query composed of a reference image and modification text. Recent training-free zero-shot methods often employ Multimodal Large Language Models (MLLMs) with Chain-of-Thought (CoT) to compose a target image description for retrieval. However, due to the fuzzy matching nature of ZS-CIR, the generated description is prone to semantic bias relative to the target image. We propose SDR-CIR, a training-free Semantic Debias Ranking method based on CoT reasoning. First, Selective CoT guides the MLLM to extract visual content relevant to the modification text during image understanding, thereby reducing visual noise at the source. We then introduce a Semantic Debias Ranking with two steps, Anchor and Debias, to mitigate semantic bias. In the Anchor step, we fuse reference image features with target description features to reinforce useful semantics and supplement omitted cues. In the Debias step, we explicitly model the visual semantic contribution of the reference image to the description and incorporate it into the similarity score as a penalty term. By supplementing omitted cues while suppressing redundancy, SDR-CIR mitigates semantic bias and improves retrieval performance. Experiments on three standard CIR benchmarks show that SDR-CIR achieves state-of-the-art results among one-stage methods while maintaining high efficiency. The code is publicly available at https://github.com/suny105/SDR-CIR.