Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular depth estimation suffers from scale ambiguity and domain shift, making metric depth recovery challenging. This work proposes a language-guided uncertainty envelope mechanism that leverages textual descriptions to provide coarse-scale priors and adaptively selects image-specific affine calibration parameters within an uncertainty-aware range, thereby avoiding reliance on noisy text-point estimates. The approach freezes both the relative depth backbone and the CLIP text encoder, integrating multi-scale visual feature pooling with an inverse-depth-space affine transformation to enable efficient and lightweight calibration. It achieves improved in-domain accuracy on NYUv2 and KITTI and demonstrates significantly better zero-shot transfer performance on SUN-RGBD and DDAD compared to purely language-based baselines, exhibiting enhanced robustness.

Technology Category

Application Category

📝 Abstract
Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.
Problem

Research questions and friction points this paper is trying to address.

monocular depth estimation
metric scale recovery
domain shift
global scale ambiguity
relative depth
Innovation

Methods, ideas, or system contributions that make the work stand out.

monocular depth estimation
language-vision calibration
metric scale recovery
uncertainty-aware envelope
frozen backbone adaptation
🔎 Similar Papers
No similar papers found.
M
Mingxing Zhan
Hefei University of Technology, Hefei, Anhui, China
Li Zhang
Li Zhang
Bytedance Inc., Qualcomm, Peking University, Institute of Computing Technology
image/video codingprocessing
B
Beibei Wang
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, Anhui, China
Yingjie Wang
Yingjie Wang
USTC
Zenglin Shi
Zenglin Shi
Professor of Artificial Intelligence, Hefei University of Technology
Deep LearningComputer VisionMachine LearningMultimedia