YOLOv12: Attention-Centric Real-Time Object Detectors

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address the accuracy bottleneck of YOLO-series detectors stemming from their reliance on CNNs, this paper proposes YOLOv12—the first attention-driven real-time object detection framework. Methodologically, it introduces a lightweight multi-scale axial-attention backbone and a decoupled attention-based detection head, integrating dynamic sparse window mechanisms and a channel-spatial joint gating module; further enhanced by knowledge distillation and hardware-aware reparameterization. Contributions include: (1) the first full-scale model to consistently outperform YOLOv10/v11 and RT-DETR across all variants; (2) YOLOv12-N achieves 40.6% mAP@0.5 at 1.64 ms latency on an NVIDIA T4 GPU; (3) YOLOv12-S delivers a 42% speedup over RT-DETR-R18, with 64% fewer FLOPs and 55% fewer parameters—demonstrating a synergistic breakthrough in both speed and accuracy.

Technology Category

Application Category

📝 Abstract

Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.

Problem

Research questions and friction points this paper is trying to address.

Enhance YOLO with attention mechanisms

Match CNN speed with attention models

Surpass real-time object detectors' accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

YOLOv12 integrates attention mechanisms

Matches CNN-based models in speed

Surpasses popular real-time object detectors

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs