MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Traditional medical image object detection relies on closed-set paradigms, limiting generalization to unseen pathological categories. To address this, we propose the first open-vocabulary object detection (OVOD) framework for multimodal medical imaging. Our approach comprises three key components: (1) constructing Omnis—a large-scale, multimodal, fine-grained medical detection dataset; (2) designing a foundation-model-guided pseudo-labeling strategy that jointly leverages contrastive learning and cross-modal representation alignment to detect both known and novel lesions; and (3) optimizing the inference architecture for real-time performance. Experiments demonstrate that our method achieves an average +40 mAP50 gain across diverse medical imaging modalities—substantially outperforming existing OVOD methods and closed-set detectors (+3 mAP50)—while maintaining a real-time inference speed of 70 FPS. This establishes a flexible, general-purpose detection foundation for dynamic clinical diagnosis.

Technology Category

Application Category

📝 Abstract

Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.

Problem

Research questions and friction points this paper is trying to address.

Detecting novel objects in medical imaging beyond predefined labels

Addressing dataset scarcity and weak text-image alignment in medical OVOD

Enabling real-time open-vocabulary detection across diverse medical modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages contrastive learning for open-vocabulary detection

Uses pseudo-labeling to handle missing annotations

Incorporates knowledge from pre-trained foundation models

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis