🤖 AI Summary
Current compositional image retrieval (CIR) research is hindered by the lack of high-quality, instance-level fine-grained matching datasets and evaluation benchmarks. To address this, we propose a novel instance-level CIR paradigm: given a visual query depicting a specific object instance and a textual modification describing semantic edits, the task is to retrieve images containing *that exact instance* after semantic manipulation. We introduce i-CIR—a compact, high-difficulty benchmark explicitly designed for instance-level evaluation. Furthermore, we propose BASIC, a training-free unified retrieval framework built upon pretrained vision-language models (VLMs). BASIC jointly estimates query–image and query–text similarities, employs late fusion with enhanced weighting, and natively supports both instance-level and semantic-level CIR. A hard negative mining strategy further improves discriminability. Experiments demonstrate that BASIC achieves state-of-the-art performance on i-CIR and multiple established CIR benchmarks, with substantial gains in instance-level retrieval accuracy.
📝 Abstract
The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives.
To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.