🤖 AI Summary
This study addresses the challenge of high-precision, real-time stereo depth estimation for autonomous tree pruning by drones, where minor disparity errors in vegetated scenes can lead to significant depth inaccuracies. The work presents the first stereo matching benchmark tailored to real-world canopy environments, using disparity maps generated by DEFOM-Stereo as supervision. It systematically evaluates ten representative networks—including BANet-3D, RAFT-Stereo, and AnyNet—across perceptual quality, structural fidelity, and embedded real-time performance. Experimental results show that BANet-3D achieves the best overall quality (SSIM = 0.883), RAFT-Stereo excels in scene understanding (ViTScore = 0.799), and AnyNet is the only near-real-time solution, running at 6.99 FPS on a Jetson Orin at 1080p resolution.
📝 Abstract
Autonomous drone-based tree pruning needs accurate, real-time depth estimation from stereo cameras. Depth is computed from disparity maps using $Z = f B/d$, so even small disparity errors cause noticeable depth mistakes at working distances. Building on our earlier work that identified DEFOM-Stereo as the best reference disparity generator for vegetation scenes, we present the first study to train and test ten deep stereo matching networks on real tree branch images. We use the Canterbury Tree Branches dataset -- 5,313 stereo pairs from a ZED Mini camera at 1080P and 720P -- with DEFOM-generated disparity maps as training targets. The ten methods cover step-by-step refinement, 3D convolution, edge-aware attention, and lightweight designs. Using perceptual metrics (SSIM, LPIPS, ViTScore) and structural metrics (SIFT/ORB feature matching), we find that BANet-3D produces the best overall quality (SSIM = 0.883, LPIPS = 0.157), while RAFT-Stereo scores highest on scene-level understanding (ViTScore = 0.799). Testing on an NVIDIA Jetson Orin Super (16 GB, independently powered) mounted on our drone shows that AnyNet reaches 6.99 FPS at 1080P -- the only near-real-time option -- while BANet-2D gives the best quality-speed balance at 1.21 FPS. We also compare 720P and 1080P processing times to guide resolution choices for forestry drone systems.