🤖 AI Summary
This study addresses the integrated challenge of maturity recognition, classification, and grasp-point localization for autonomous harvesting of greenhouse tomatoes by proposing a lightweight YOLO variant that simultaneously performs detection, maturity classification, and center grasp-point regression. The model incorporates several novel components: a Lightweight Feature Pyramid Network (LFPN), a Ripeness-Aware Attention Module (RAAM), and a Compact Detection Head (CDH), leveraging depthwise separable convolutions, dual-pooling attention, learnable maturity bias, and HSV-based data augmentation. A staged unfreezing training strategy is employed to enhance optimization. Evaluated on a dataset of 1,500 images, the model achieves 92.9% mAP@0.5 with only 2.38 million parameters; after pruning 30% of BatchNorm layers, the parameter count reduces to 1.8 million with negligible accuracy loss, significantly outperforming existing YOLO-based approaches.
📝 Abstract
In greenhouse tomato production, automated harvesting requires accurate detection of ripe tomatoes, ripeness classification, and precise picking-point localization for robotic end-effectors. This paper proposes YOLO26-RipeLoc Lite, a lightweight deep learning architecture based on YOLO26 for simultaneous detection, ripeness classification, and center-point localization of greenhouse tomatoes. The model introduces three modifications: (1) a Lightweight Feature Pyramid Network (LFPN) with depthwise separable convolutions for efficient multi-scale fusion, (2) a Ripeness-Aware Attention Module (RAAM) with dual pooling and a learnable ripeness bias vector for enhanced color-texture discrimination, and (3) a Compact Detection Head (CDH) with shared convolutions and an integrated center-point regression branch for direct grasp planning. The model is evaluated on a custom dataset of 1,500 images with 6,227 instances (3,566 ripe, 2,661 unripe) from the SILAL greenhouse, Abu Dhabi, UAE. YOLO26-RipeLoc Lite achieves mAP@0.5 of 92.9% (95.2% ripe, 90.6% unripe) with the highest precision (95.2%) among all evaluated architectures using only 2.38M parameters. Post-training BatchNorm pruning at 30% reduces parameters to ~1.8M with negligible accuracy loss. Ablation studies confirm that greenhouse-aware HSV augmentation provides the largest improvement (+2.02 pp mAP@50), backbone freezing achieves peak precision (93.8%), and 3-phase progressive unfreezing yields the best localization quality (mAP@50:95 of 64.6%). Comparisons with YOLOv8n/s, YOLO11n/s, YOLO12n/s, and YOLO26s confirm superior accuracy-efficiency: 2.9 pp higher precision than YOLO12n with 7.0% fewer parameters and integrated center-point localization for robotic end-effector guidance.