WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

📅 2025-12-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-vocabulary object detection faces challenges of slow inference and poor task generalization due to cross-modal fusion. This paper introduces the WeDetect series, pioneering a fully retrieval-based paradigm: detection is formulated as cross-modal matching between image regions and text prompts within a unified embedding space, eliminating conventional fusion layers. Key contributions include (1) WeDetect-Uni, a universal proposal generator enabling class-specific retrieval; (2) WeDetect-Ref, a lightweight single-pass large multimodal model (LMM) for referring expression comprehension (REC), achieving zero-token prediction and one-step execution; and (3) architectural innovations including dual-tower encoders, frozen detector with tunable objectness prompt fine-tuning, and proposal embedding alignment. Evaluated on 15 benchmarks, WeDetect achieves state-of-the-art performance, supports real-time detection, historical object retrieval, and multi-task generalization—significantly improving both efficiency and versatility.

Technology Category

Application Category

📝 Abstract
Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, ie, matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data. (3) Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass. Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency.
Problem

Research questions and friction points this paper is trying to address.

Detects arbitrary objects using text prompts without cross-modal fusion layers
Enables fast retrieval of objects from historical data using universal proposals
Integrates with language models to comprehend complex referring expressions efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-fusion dual-tower architecture for real-time detection
Universal proposal generator enabling fast object retrieval
LMM-based classifier for complex referring expression comprehension
🔎 Similar Papers
No similar papers found.