ZERO: Multi-modal Prompt-based Visual Grounding

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the challenge of robust object detection under few-shot, multimodal conditions in industrial settings, this paper proposes a multimodal prompt-driven zero-shot detection framework. Methodologically, we design a dedicated vision-language dual-modality encoder that jointly processes textual and visual prompts; further, we introduce prompt diversity optimization and a conservative pseudo-labeling strategy to enhance zero-shot generalization. The model is pretrained on a billion-scale image corpus, comprising 620 million parameters and requiring 1.033 TFLOPS computational throughput. On the RF20VL-fsod benchmark, it significantly outperforms existing methods, demonstrating effectiveness and industrial applicability for cross-domain deployment with minimal annotation cost. Our core contribution is the first multimodal prompt-fusion architecture tailored for industrial zero-shot detection—uniquely balancing scalability, adaptability, and practical deployability.

Technology Category

Application Category

📝 Abstract

Recent advances in artificial intelligence have led to the emergence of foundation models, large-scale pre-trained neural networks that serve as versatile starting points for a wide range of downstream tasks. In this work, we present ZERO, a zero-shot multi-prompt object detection model specifically designed for robust, production-ready deployment across diverse industrial domains. ZERO integrates direct image input with multiple user-defined prompts, which can include both textual and visual cues, and processes them through dedicated encoders to generate accurate detection outputs. The model architecture is optimized for scalability, with a total of 1.033 TFLOPS and 622.346 million parameters, and is trained using a domain-specific image database exceeding one billion images. For the CVPR 2025 Foundational Few-Shot Object Detection (FSOD) Challenge, we introduce a domain-specific fine-tuning strategy that emphasizes prompt diversity and conservative pseudo-labeling, enabling effective adaptation to new domains with minimal supervision. Our approach demonstrates practical advantages in flexibility, efficiency, and real-world applicability, achieving strong performance on the RF20VL-fsod benchmark despite limited annotation budgets. The results highlight the potential of prompt-driven, data-centric AI for scalable and adaptive object detection in dynamic industrial environments.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot multi-prompt object detection for diverse industries

Integration of image and multi-modal prompts for accurate detection

Domain adaptation with minimal supervision using prompt diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-prompt object detection with text and visual cues

Scalable architecture with 1.033 TFLOPS and 622M parameters

Domain-specific fine-tuning with prompt diversity strategy

🔎 Similar Papers

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision