ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing visual search methods decouple sentence-level cross-image retrieval from pixel-level localization: text-to-image retrieval lacks fine-grained grounding capability, while referring expression localization assumes target presence—a strong prior that leads to high false-positive rates in large-scale settings. This paper introduces Referring Search and Discovery (ReSeDis), a novel task that unifies cross-image existence verification and pixel-level target localization (via bounding boxes or masks) for natural language queries over large image corpora. To support this task, we construct the first large-scale, ambiguity-resolved ReSeDis benchmark and design a joint evaluation metric balancing recall and localization accuracy. We further propose a zero-shot baseline leveraging frozen multimodal foundation models (e.g., CLIP), integrating corpus-level retrieval with instance-level referring localization. Experiments reveal substantial headroom for improvement and establish a new paradigm for robust, scalable end-to-end multimodal search.

Technology Category

Application Category

📝 Abstract

Large-scale visual search engines are expected to solve a dual problem at once: (i) locate every image that truly contains the object described by a sentence and (ii) identify the object's bounding box or exact pixels within each hit. Existing techniques address only one side of this challenge. Visual grounding yields tight boxes and masks but rests on the unrealistic assumption that the object is present in every test image, producing a flood of false alarms when applied to web-scale collections. Text-to-image retrieval excels at sifting through massive databases to rank relevant images, yet it stops at whole-image matches and offers no fine-grained localization. We introduce Referring Search and Discovery (ReSeDis), the first task that unifies corpus-level retrieval with pixel-level grounding. Given a free-form description, a ReSeDis model must decide whether the queried object appears in each image and, if so, where it is, returning bounding boxes or segmentation masks. To enable rigorous study, we curate a benchmark in which every description maps uniquely to object instances scattered across a large, diverse corpus, eliminating unintended matches. We further design a task-specific metric that jointly scores retrieval recall and localization precision. Finally, we provide a straightforward zero-shot baseline using a frozen vision-language model, revealing significant headroom for future study. ReSeDis offers a realistic, end-to-end testbed for building the next generation of robust and scalable multimodal search systems.

Problem

Research questions and friction points this paper is trying to address.

Unify corpus-level retrieval with pixel-level grounding in visual search

Address false alarms in web-scale image collections with visual grounding

Develop a metric for joint retrieval recall and localization precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies corpus-level retrieval with pixel-level grounding

Introduces a benchmark for unique object-instance mapping

Uses frozen vision-language model for zero-shot baseline

🔎 Similar Papers

Unsupervised Collaborative Metric Learning with Mixed-Scale Groups for General Object Retrieval