π€ AI Summary
Existing visual search methods decouple sentence-level cross-image retrieval from pixel-level localization: text-to-image retrieval lacks fine-grained grounding capability, while referring expression localization assumes target presenceβa strong prior that leads to high false-positive rates in large-scale settings. This paper introduces Referring Search and Discovery (ReSeDis), a novel task that unifies cross-image existence verification and pixel-level target localization (via bounding boxes or masks) for natural language queries over large image corpora. To support this task, we construct the first large-scale, ambiguity-resolved ReSeDis benchmark and design a joint evaluation metric balancing recall and localization accuracy. We further propose a zero-shot baseline leveraging frozen multimodal foundation models (e.g., CLIP), integrating corpus-level retrieval with instance-level referring localization. Experiments reveal substantial headroom for improvement and establish a new paradigm for robust, scalable end-to-end multimodal search.
π Abstract
Large-scale visual search engines are expected to solve a dual problem at once: (i) locate every image that truly contains the object described by a sentence and (ii) identify the object's bounding box or exact pixels within each hit. Existing techniques address only one side of this challenge. Visual grounding yields tight boxes and masks but rests on the unrealistic assumption that the object is present in every test image, producing a flood of false alarms when applied to web-scale collections. Text-to-image retrieval excels at sifting through massive databases to rank relevant images, yet it stops at whole-image matches and offers no fine-grained localization. We introduce Referring Search and Discovery (ReSeDis), the first task that unifies corpus-level retrieval with pixel-level grounding. Given a free-form description, a ReSeDis model must decide whether the queried object appears in each image and, if so, where it is, returning bounding boxes or segmentation masks. To enable rigorous study, we curate a benchmark in which every description maps uniquely to object instances scattered across a large, diverse corpus, eliminating unintended matches. We further design a task-specific metric that jointly scores retrieval recall and localization precision. Finally, we provide a straightforward zero-shot baseline using a frozen vision-language model, revealing significant headroom for future study. ReSeDis offers a realistic, end-to-end testbed for building the next generation of robust and scalable multimodal search systems.