🤖 AI Summary
This work addresses the challenges of automating sweet pepper harvesting in outdoor unstructured environments, where occlusion and complex backgrounds hinder reliable perception and manipulation. To this end, the authors propose VADER, a dual-arm mobile harvesting robot that integrates hierarchical visual perception—from scene-level detection to fruit-level pose estimation—coordinated dual-arm motion planning, and a teleoperation fallback mechanism based on the GELLO framework. The system achieves the first demonstration of autonomous, coordinated dual-arm harvesting in real-world agricultural fields. A cross-domain sweet pepper dataset comprising over 3,200 images was curated to enable end-to-end training. Experimental results show a harvesting success rate exceeding 60% under outdoor conditions, with a per-fruit cycle time under 100 seconds. The dataset and code have been publicly released to advance research in agricultural robotics.
📝 Abstract
Agricultural robotics has emerged as a critical solution to the labor shortages and rising costs associated with manual crop harvesting. Bell pepper harvesting, in particular, is a labor-intensive task, accounting for up to 50% of total production costs. While automated solutions have shown promise in controlled greenhouse environments, harvesting in unstructured outdoor farms remains an open challenge due to environmental variability and occlusion. This paper presents VADER (Vision-guided Autonomous Dual-arm Extraction Robot), a dual-arm mobile manipulation system designed specifically for the autonomous harvesting of bell peppers in outdoor environments. The system integrates a robust perception pipeline coupled with a dual-arm planning framework that coordinates a gripping arm and a cutting arm for extraction. We validate the system through trials in various realistic conditions, demonstrating a harvest success rate exceeding 60% with a cycle time of under 100 seconds per fruit, while also featuring a teleoperation fail-safe based on the GELLO teleoperation framework to ensure robustness. To support robust perception, we contribute a hierarchically structured dataset of over 3,200 images spanning indoor and outdoor domains, pairing wide-field scene images with close-up pepper images to enable a coarse-to-fine training strategy from fruit detection to high-precision pose estimation. The code and dataset will be made publicly available upon acceptance.