RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Detecting green fruits in orchards remains challenging due to ambiguous labeling, severe occlusion, and strong background similarity. Method: This work constructs a dual-modality (single-class/multi-class) annotated dataset and conducts the first systematic comparison of RF-DETR (featuring a DINOv2 backbone and deformable attention) and YOLOv12 on fine-grained occlusion classification (occluded vs. non-occluded). Contribution/Results: RF-DETR achieves superior generalization under low-quality annotations—converging within only 10 epochs—and attains mAP50 scores of 0.9464 (single-class) and 0.8298 (multi-class), significantly outperforming YOLOv12. In contrast, YOLOv12-N/L excels in mAP50:95, highlighting the complementary trade-off between Transformer-based global modeling and CNN-based efficiency. This study establishes a new paradigm and benchmark for agricultural vision-based fruit detection.

Technology Category

Application Category

📝 Abstract
This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios.>Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs
Problem

Research questions and friction points this paper is trying to address.

Compare RF-DETR and YOLOv12 for greenfruit detection in orchards
Evaluate model performance under occlusion and label ambiguity
Assess single-class and multi-class detection in complex environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

RF-DETR uses DINOv2 backbone for global context
YOLOv12 employs CNN attention for local features
Custom dataset evaluates single and multi-class detection
🔎 Similar Papers
No similar papers found.