🤖 AI Summary
Wildlife monitoring using monocular camera traps suffers from the absence of metric depth information, hindering accurate individual distance estimation. To address this, we introduce WildDepth—the first monocular metric depth estimation (MDE) benchmark specifically designed for field camera-trap scenarios—where ground-truth depth is obtained via ChArUco-based geometric calibration and measurement. We systematically evaluate leading MDE models, including Depth Anything V2, ML Depth Pro, ZoeDepth, and Metric3D. Results show that median depth aggregation significantly outperforms mean aggregation. Depth Anything V2 achieves the best accuracy–speed trade-off (MAE = 0.454 m, correlation coefficient = 0.962, latency = 0.22 s/image), whereas ZoeDepth, though fastest (0.17 s/image), exhibits high error (MAE = 3.087 m). WildDepth fills a critical gap in systematic MDE evaluation under natural ecological conditions and provides empirical guidance for model selection in biodiversity monitoring.
📝 Abstract
Camera traps are widely used for wildlife monitoring, but extracting accurate distance measurements from monocular images remains challenging due to the lack of depth information. While monocular depth estimation (MDE) methods have advanced significantly, their performance in natural wildlife environments has not been systematically evaluated. This work introduces the first benchmark for monocular metric depth estimation in wildlife monitoring conditions. We evaluate four state-of-the-art MDE methods (Depth Anything V2, ML Depth Pro, ZoeDepth, and Metric3D) alongside a geometric baseline on 93 camera trap images with ground truth distances obtained using calibrated ChARUCO patterns. Our results demonstrate that Depth Anything V2 achieves the best overall performance with a mean absolute error of 0.454m and correlation of 0.962, while methods like ZoeDepth show significant degradation in outdoor natural environments (MAE: 3.087m). We find that median-based depth extraction consistently outperforms mean-based approaches across all deep learning methods. Additionally, we analyze computational efficiency, with ZoeDepth being fastest (0.17s per image) but least accurate, while Depth Anything V2 provides an optimal balance of accuracy and speed (0.22s per image). This benchmark establishes performance baselines for wildlife applications and provides practical guidance for implementing depth estimation in conservation monitoring systems.