A Sanity Check on Composed Image Retrieval

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
Existing compositional image retrieval (CIR) evaluation benchmarks suffer from ambiguous queries and a lack of modeling for multi-turn interactive scenarios, limiting their ability to faithfully reflect model performance. To address these limitations, this work introduces the FISD benchmark, which leverages generative models to construct semantically diverse and controllable image pairs, thereby eliminating query ambiguity. Furthermore, it proposes the first automated multi-turn agent-based evaluation framework that simulates realistic user–system interactions. Through systematic assessment of state-of-the-art CIR models across six distinct dimensions, the proposed approach substantially enhances both the accuracy and practicality of evaluation, demonstrating the effectiveness of this new benchmarking paradigm in capturing real-world application scenarios.

Technology Category

Application Category

📝 Abstract
Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image, and a relative caption that specifies the desired modification. Despite the rapid development of CIR models, their performance is not well characterized by existing benchmarks, which inherently contain indeterminate queries degrading the evaluation (i.e., multiple candidate images, rather than solely the target image, meet the query criteria), and have not considered their effectiveness in the context of the multi-round system. Motivated by this, we consider improving the evaluation procedure from two aspects: 1) we introduce FISD, a Fully-Informed Semantically-Diverse benchmark, which employs generative models to precisely control the variables of reference-target image pairs, enabling a more accurate evaluation of CIR methods across six dimensions, without query ambiguity; 2) we propose an automatic multi-round agentic evaluation framework to probe the potential of the existing models in the interactive scenarios. By observing how models adapt and refine their choices over successive rounds of queries, this framework provides a more realistic appraisal of their efficacy in practical applications. Extensive experiments and comparisons prove the value of our novel evaluation on typical CIR methods.
Problem

Research questions and friction points this paper is trying to address.

Composed Image Retrieval
evaluation benchmark
query ambiguity
multi-round interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Composed Image Retrieval
FISD benchmark
generative models
multi-round evaluation
query disambiguation
🔎 Similar Papers
No similar papers found.