🤖 AI Summary
Existing autonomous pruning systems are hindered from widespread forestry adoption due to their reliance on expensive sensors and difficulty in accurately localizing fine branches as small as 10 mm in diameter. This work proposes a two-stage approach using a single low-cost stereo camera, integrating instance segmentation (via YOLOv8/YOLOv9) with depth estimation to achieve high-precision 3D localization of radiata pine branches. The method innovatively combines stereo-based segmentation, centroid triangulation, and MAD-based outlier rejection to effectively address challenges posed by sparse textures, thin structures, and disparity noise in forest environments. Experimental results demonstrate that learning-based stereo matching models—such as PSMNet and ACVNet—produce more coherent depth maps within 1–2 meters, enabling the system to robustly estimate accurate branch-to-camera distances suitable for autonomous drone-based pruning.
📝 Abstract
Manual pruning of radiata pine, a species of major economic importance to New Zealand forestry, is hazardous, labour-intensive, and increasingly constrained by workforce shortages. Existing autonomous pruning platforms typically rely on expensive sensors such as LiDAR and are limited to thick branches, which restricts their wider adoption. This paper investigates whether a single low-cost stereo camera mounted on a drone can provide sufficiently accurate branch detection and three-dimensional positioning to support autonomous pruning of branches as thin as 10 mm, thereby removing the need for auxiliary depth sensors. The proposed pipeline comprises two stages: branch segmentation and depth estimation. For segmentation, Mask R-CNN variants and the YOLOv8 and YOLOv9 families are compared on a custom dataset of 71 stereo image pairs captured with a ZED Mini camera; YOLOv8 and YOLOv9 are selected as representative state-of-the-art real-time segmentors at the time of data collection, and the framework is designed to remain compatible with newer YOLO releases. For depth estimation, a traditional method (SGBM with WLS filtering) and deep-learning-based methods (PSMNet, ACVNet, GWCNet, MobileStereoNet, RAFT-Stereo, and NeRF-Supervised Deep Stereo) are evaluated, including cross-dataset fine-tuning experiments that expose the domain gap between urban driving benchmarks and natural forestry scenes. The main novelty of this work lies in coupling stereo segmentation with a centroid-based triangulation algorithm and Median-Absolute-Deviation outlier rejection that converts a segmentation mask and disparity map into a single robust branch-to-camera distance, addressing the challenges of sparse texture, thin structures, and noisy disparity values typical of forest scenes. Qualitative evaluations at distances of 1-2 m show that the learning-based stereo methods produce more coherent depth es...