Training Deep Stereo Matching Networks on Tree Branch Imagery: A Benchmark Study for Real-Time UAV Forestry Applications

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of high-precision, real-time stereo depth estimation for autonomous tree pruning by drones, where minor disparity errors in vegetated scenes can lead to significant depth inaccuracies. The work presents the first stereo matching benchmark tailored to real-world canopy environments, using disparity maps generated by DEFOM-Stereo as supervision. It systematically evaluates ten representative networks—including BANet-3D, RAFT-Stereo, and AnyNet—across perceptual quality, structural fidelity, and embedded real-time performance. Experimental results show that BANet-3D achieves the best overall quality (SSIM = 0.883), RAFT-Stereo excels in scene understanding (ViTScore = 0.799), and AnyNet is the only near-real-time solution, running at 6.99 FPS on a Jetson Orin at 1080p resolution.

Technology Category

Application Category

📝 Abstract
Autonomous drone-based tree pruning needs accurate, real-time depth estimation from stereo cameras. Depth is computed from disparity maps using $Z = f B/d$, so even small disparity errors cause noticeable depth mistakes at working distances. Building on our earlier work that identified DEFOM-Stereo as the best reference disparity generator for vegetation scenes, we present the first study to train and test ten deep stereo matching networks on real tree branch images. We use the Canterbury Tree Branches dataset -- 5,313 stereo pairs from a ZED Mini camera at 1080P and 720P -- with DEFOM-generated disparity maps as training targets. The ten methods cover step-by-step refinement, 3D convolution, edge-aware attention, and lightweight designs. Using perceptual metrics (SSIM, LPIPS, ViTScore) and structural metrics (SIFT/ORB feature matching), we find that BANet-3D produces the best overall quality (SSIM = 0.883, LPIPS = 0.157), while RAFT-Stereo scores highest on scene-level understanding (ViTScore = 0.799). Testing on an NVIDIA Jetson Orin Super (16 GB, independently powered) mounted on our drone shows that AnyNet reaches 6.99 FPS at 1080P -- the only near-real-time option -- while BANet-2D gives the best quality-speed balance at 1.21 FPS. We also compare 720P and 1080P processing times to guide resolution choices for forestry drone systems.
Problem

Research questions and friction points this paper is trying to address.

stereo matching
depth estimation
UAV forestry
real-time processing
tree branch imagery
Innovation

Methods, ideas, or system contributions that make the work stand out.

deep stereo matching
tree branch imagery
real-time UAV
disparity estimation
embedded benchmarking
🔎 Similar Papers
No similar papers found.
Y
Yida Lin
Centre for Data Science and Artificial Intelligence, Victoria University of Wellington, Wellington, New Zealand
Bing Xue
Bing Xue
Meta Superintelligence Labs
LLMmachine learning for healthcarerepresentation learninggenerative models
M
Mengjie Zhang
Centre for Data Science and Artificial Intelligence, Victoria University of Wellington, Wellington, New Zealand
S
Sam Schofield
Department of Computer Science and Software Engineering, University of Canterbury, Canterbury, New Zealand
R
Richard Green
Department of Computer Science and Software Engineering, University of Canterbury, Canterbury, New Zealand