A Sanity Check on Composed Image Retrieval

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing compositional image retrieval (CIR) evaluation benchmarks suffer from ambiguous queries and a lack of modeling for multi-turn interactive scenarios, limiting their ability to faithfully reflect model performance. To address these limitations, this work introduces the FISD benchmark, which leverages generative models to construct semantically diverse and controllable image pairs, thereby eliminating query ambiguity. Furthermore, it proposes the first automated multi-turn agent-based evaluation framework that simulates realistic user–system interactions. Through systematic assessment of state-of-the-art CIR models across six distinct dimensions, the proposed approach substantially enhances both the accuracy and practicality of evaluation, demonstrating the effectiveness of this new benchmarking paradigm in capturing real-world application scenarios.

Technology Category

Application Category

📝 Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image, and a relative caption that specifies the desired modification. Despite the rapid development of CIR models, their performance is not well characterized by existing benchmarks, which inherently contain indeterminate queries degrading the evaluation (i.e., multiple candidate images, rather than solely the target image, meet the query criteria), and have not considered their effectiveness in the context of the multi-round system. Motivated by this, we consider improving the evaluation procedure from two aspects: 1) we introduce FISD, a Fully-Informed Semantically-Diverse benchmark, which employs generative models to precisely control the variables of reference-target image pairs, enabling a more accurate evaluation of CIR methods across six dimensions, without query ambiguity; 2) we propose an automatic multi-round agentic evaluation framework to probe the potential of the existing models in the interactive scenarios. By observing how models adapt and refine their choices over successive rounds of queries, this framework provides a more realistic appraisal of their efficacy in practical applications. Extensive experiments and comparisons prove the value of our novel evaluation on typical CIR methods.

Problem

Research questions and friction points this paper is trying to address.

Composed Image Retrieval

evaluation benchmark

query ambiguity

multi-round interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Composed Image Retrieval

FISD benchmark

generative models

multi-round evaluation