🤖 AI Summary
To address the real-time and accuracy bottlenecks in detecting small objects within 4K omnidirectional images—caused by severe geometric distortion, ultra-high resolution, and wide field-of-view—this paper proposes a lightweight and efficient detection framework. Methodologically, it introduces a P2 multi-scale detection head to enhance sensitivity to small objects, adopts GhostConv as a lightweight backbone to significantly reduce parameter count while preserving representational capacity, and integrates multi-scale features—including the P2 layer—into a YOLO-based architecture with an optimized inference pipeline. Furthermore, we establish CVIP360, the first open-source benchmark for 4K omnidirectional object detection, comprising 6,876 frames with precise bounding-box annotations. Experiments demonstrate that our method achieves 0.95 mAP@0.5IoU on 4K panoramic images, with a single-frame inference latency of only 28.3 ms—75% faster and 4.2 percentage points higher in mAP than YOLOv11—thereby delivering both state-of-the-art performance and practical deployability.
📝 Abstract
The processing of omnidirectional 360-degree images poses significant challenges for object detection due to inherent spatial distortions, wide fields of view, and ultra-high-resolution inputs. Conventional detectors such as YOLO are optimised for standard image sizes (for example, 640x640 pixels) and often struggle with the computational demands of 4K or higher-resolution imagery typical of 360-degree vision. To address these limitations, we introduce YOLO11-4K, an efficient real-time detection framework tailored for 4K panoramic images. The architecture incorporates a novel multi-scale detection head with a P2 layer to improve sensitivity to small objects often missed at coarser scales, and a GhostConv-based backbone to reduce computational complexity without sacrificing representational power. To enable evaluation, we manually annotated the CVIP360 dataset, generating 6,876 frame-level bounding boxes and producing a publicly available, detection-ready benchmark for 4K panoramic scenes. YOLO11-4K achieves 0.95 mAP at 0.50 IoU with 28.3 milliseconds inference per frame, representing a 75 percent latency reduction compared to YOLO11 (112.3 milliseconds), while also improving accuracy (mAP at 0.50 of 0.95 versus 0.908). This balance of efficiency and precision enables robust object detection in expansive 360-degree environments, making the framework suitable for real-world high-resolution panoramic applications. While this work focuses on 4K omnidirectional images, the approach is broadly applicable to high-resolution detection tasks in autonomous navigation, surveillance, and augmented reality.