🤖 AI Summary
Modern YOLO architectures face persistent trade-offs among detection accuracy, inference speed, and deployment efficiency—particularly for small objects and edge deployment.
Method: This work systematically analyzes the architectural evolution of Ultralytics YOLO (YOLOv5, YOLOv8, YOLO11, YOLO26), introducing key innovations: DFL removal for improved localization; NMS-free inference with decoupled detection heads for real-time performance; ProgLoss for dynamic loss balancing; STAL label assignment and MuSGD optimizer for enhanced training stability; and anchor-free prediction with hybrid task assignment to boost small-object detection.
Contribution/Results: The proposed framework achieves superior AP–FPS trade-offs on MS COCO, supports quantization-aware training and cross-scenario deployment, and establishes a theoretically grounded, industrially viable paradigm for next-generation lightweight YOLO models.
📝 Abstract
This paper presents a comprehensive overview of the Ultralytics YOLO(You Only Look Once) family of object detectors, focusing the architectural evolution, benchmarking, deployment perspectives, and future challenges. The review begins with the most recent release, YOLO26 (YOLOv26), which introduces key innovations including Distribution Focal Loss (DFL) removal, native NMS-free inference, Progressive Loss Balancing (ProgLoss), Small-Target-Aware Label Assignment (STAL), and the MuSGD optimizer for stable training. The progression is then traced through YOLO11, with its hybrid task assignment and efficiency-focused modules; YOLOv8, which advanced with a decoupled detection head and anchor-free predictions; and YOLOv5, which established the modular PyTorch foundation that enabled modern YOLO development. Benchmarking on the MS COCO dataset provides a detailed quantitative comparison of YOLOv5, YOLOv8, YOLO11, and YOLO26, alongside cross-comparisons with YOLOv12, YOLOv13, RT-DETR, and DEIM. Metrics including precision, recall, F1 score, mean Average Precision, and inference speed are analyzed to highlight trade-offs between accuracy and efficiency. Deployment and application perspectives are further discussed, covering export formats, quantization strategies, and real-world use in robotics, agriculture, surveillance, and manufacturing. Finally, the paper identifies challenges and future directions, including dense-scene limitations, hybrid CNN-Transformer integration, open-vocabulary detection, and edge-aware training approaches.