🤖 AI Summary
Monocular depth estimation (MDE) is inherently ill-posed, making reliable 3D structure recovery from a single 2D image challenging. To address this, we propose WEDepth—a zero-shot adaptation method that activates geometric and scene priors embedded in pre-trained vision foundation models (VFMs) without fine-tuning, architectural modification, or weight updates. WEDepth employs a multi-level feature injection mechanism that explicitly integrates shallow texture and deep semantic features, enabling context-aware depth reconstruction. Evaluated on NYU-Depth v2 and KITTI, it achieves state-of-the-art performance—comparable to diffusion-based or multi-step inference methods—while demonstrating strong cross-domain zero-shot generalization. Our key contribution is the first plug-and-play activation of multi-level implicit priors within VFMs, achieving an unprecedented balance among computational efficiency, architectural generality, and interpretability.
📝 Abstract
Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios.