Referring Expression Instance Retrieval and A Strong End-to-End Baseline

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image retrieval (TIR) lacks instance-level localization capability, while referring expression comprehension (REC) struggles to scale to large-scale image corpora. To bridge this gap, we propose a new task—referring expression instance retrieval (REIR)—enabling fine-grained, text-driven instance retrieval and precise localization across massive image collections. We introduce REIRCOCO, the first REIR benchmark supporting joint image-level and instance-level evaluation. Methodologically, we design CLARE, a unified dual-stream framework incorporating a Mixture-of-Relation-Experts (MORE) module, and jointly optimize it via object detection, REC pretraining, and contrastive language–instance alignment (CLIA). Experiments demonstrate that CLARE achieves state-of-the-art performance on REIR, while also exhibiting strong generalization to both TIR and REC tasks—effectively reconciling accuracy and scalability.

Technology Category

Application Category

📝 Abstract
Natural language querying of visual content underpins many vision-language tasks, typically categorized by text granularity and visual search scope. Text-Image Retrieval (TIR) retrieves whole images using coarse descriptions, while Referring Expression Comprehension (REC) localizes objects using fine-grained expressions within a single image. However, real-world scenarios often require both instance-level retrieval and localization across large galleries -- tasks where TIR lacks precision and REC lacks scalability. To address this gap, we propose a new task: Referring Expression Instance Retrieval (REIR), which jointly supports instance-level retrieval and localization. We introduce REIRCOCO, a large-scale benchmark constructed by prompting vision-language models to generate fine-grained expressions for MSCOCO and RefCOCO instances. We also present a baseline method, CLARE, featuring a dual-stream architecture with a Mix of Relation Experts (MORE) module for capturing inter-instance relationships. CLARE integrates object detection and REC pretraining with Contrastive Language-Instance Alignment (CLIA) for end-to-end optimization. Experiments show that CLARE achieves state-of-the-art performance on REIR and generalizes well to TIR and REC, highlighting its effectiveness and versatility.
Problem

Research questions and friction points this paper is trying to address.

Bridging instance-level retrieval and localization gaps
Creating scalable fine-grained visual-language querying
Integrating detection and comprehension for end-to-end optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces REIR for instance retrieval and localization
Develops REIRCOCO benchmark with vision-language models
Proposes CLARE with MORE module for end-to-end optimization
🔎 Similar Papers
No similar papers found.