OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses real-time open-vocabulary object detection by balancing low latency with the ability to recognize a large and extensible set of categories in dynamic environments. Building upon the DETR architecture, we present the first end-to-end efficient open-vocabulary detector that integrates vision-language joint modeling and introduces a query augmentation mechanism to enhance semantic discriminability. Additionally, we propose GridSynthetic, a grid-based synthetic data augmentation strategy that significantly improves detection performance on rare classes without compromising inference speed. Experimental results demonstrate that our method achieves state-of-the-art performance on open-vocabulary detection benchmarks, offering both high accuracy and computational efficiency.

Technology Category

Application Category

📝 Abstract
Real-time open-vocabulary object detection (OVOD) is essential for practical deployment in dynamic environments, where models must recognize a large and evolving set of categories under strict latency constraints. Current real-time OVOD methods are predominantly built upon YOLO-style models. In contrast, real-time DETR-based methods still lag behind in terms of inference latency, model lightweightness, and overall performance. In this work, we present OV-DEIM, an end-to-end DETR-style open-vocabulary detector built upon the recent DEIMv2 framework with integrated vision-language modeling for efficient open-vocabulary inference. We further introduce a simple query supplement strategy that improves Fixed AP without compromising inference speed. Beyond architectural improvements, we introduce GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids. By exposing the model to richer object co-occurrence patterns and spatial layouts within a single forward pass, GridSynthetic mitigates the negative impact of noisy localization signals on the classification loss and improves semantic discrimination, particularly for rare categories. Extensive experiments demonstrate that OV-DEIM achieves state-of-the-art performance on open-vocabulary detection benchmarks, delivering superior efficiency and notable improvements on challenging rare categories. Code and pretrained models are available at https://github.com/wleilei/OV-DEIM.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary object detection
real-time detection
DETR
inference latency
rare categories
Innovation

Methods, ideas, or system contributions that make the work stand out.

DETR-style
open-vocabulary object detection
GridSynthetic augmentation
vision-language modeling
real-time detection
🔎 Similar Papers
No similar papers found.
L
Leilei Wang
Intellindust AI Lab; Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China; College of Computer Science and Software Engineering, Shenzhen University, China
L
Longfei Liu
Intellindust AI Lab
Xi Shen
Xi Shen
Chief Scientist, Intellindust
Deep LearningComputer Vision
Xuanlong Yu
Xuanlong Yu
Paris-Saclay University & ENSTA Paris, France
Computer VisionDeep LearningUncertainty Estimation
Y
Ying Tiffany He
College of Computer Science and Software Engineering, Shenzhen University, China
F
Fei Richard Yu
College of Computer Science and Software Engineering, Shenzhen University, China; School of Information Technology, Carleton University, Canada
Yingyi Chen
Yingyi Chen
IRB Bellinzona, Switzerland; KU Leuven, Belgium
Machine LearningDeep Learning