WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular depth estimation (MDE) is inherently ill-posed, making reliable 3D structure recovery from a single 2D image challenging. To address this, we propose WEDepth—a zero-shot adaptation method that activates geometric and scene priors embedded in pre-trained vision foundation models (VFMs) without fine-tuning, architectural modification, or weight updates. WEDepth employs a multi-level feature injection mechanism that explicitly integrates shallow texture and deep semantic features, enabling context-aware depth reconstruction. Evaluated on NYU-Depth v2 and KITTI, it achieves state-of-the-art performance—comparable to diffusion-based or multi-step inference methods—while demonstrating strong cross-domain zero-shot generalization. Our key contribution is the first plug-and-play activation of multi-level implicit priors within VFMs, achieving an unprecedented balance among computational efficiency, architectural generality, and interpretability.

Technology Category

Application Category

📝 Abstract
Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios.
Problem

Research questions and friction points this paper is trying to address.

Adapting vision foundation models for monocular depth estimation
Enhancing depth prediction without modifying pretrained model structures
Leveraging world knowledge priors across different representation levels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts Vision Foundation Models without structural modifications
Uses VFMs as multi-level feature enhancers
Systematically injects prior knowledge at different representation levels