Real-Time Object Detection Meets DINOv3

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of balancing performance and efficiency in real-time object detection across heterogeneous devices—from GPUs to mobile platforms—this paper proposes DEIMv2, a unified lightweight detection framework. DEIMv2 introduces the Spatial Tuning Adapter (STA), the first module enabling lossless upscaling of DINOv3’s single-scale features into multi-scale detection features. It integrates the HGNetv2 backbone with joint depth-and-width pruning, a simplified decoder, and an enhanced Dense One-to-One (O2O) matching mechanism. Experiments demonstrate that DEIMv2-X (50.3M parameters) achieves 57.8 AP—surpassing larger models; DEIMv2-S (9.71M parameters) attains 50.9 AP, marking the first sub-10M-parameter detector to exceed 50 AP; and DEIMv2-Pico (1.5M parameters) achieves 38.5 AP with superior energy efficiency. This work establishes a new state-of-the-art benchmark for ultra-lightweight real-time detection, being the first to break the 50-AP barrier at <10M parameters.

Technology Category

Application Category

📝 Abstract
Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3's single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters.
Problem

Research questions and friction points this paper is trying to address.

Extending DEIM with DINOv3 features for real-time detection
Achieving superior performance-cost trade-off across diverse deployments
Establishing new state-of-the-art results with fewer parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates DINOv3 features with a Spatial Tuning Adapter
Uses HGNetv2 with pruning for ultra-lightweight models
Employs a simplified decoder and upgraded Dense O2O
🔎 Similar Papers
No similar papers found.
S
Shihua Huang
Intellindust AI Lab
Y
Yongjie Hou
Intellindust AI Lab; Xiamen University
L
Longfei Liu
Intellindust AI Lab
Xuanlong Yu
Xuanlong Yu
Paris-Saclay University & ENSTA Paris, France
Computer VisionDeep LearningUncertainty Estimation
Xi Shen
Xi Shen
Chief Scientist, Intellindust
Deep LearningComputer Vision