Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Monocular depth estimation suffers from scale ambiguity and domain shift, making metric depth recovery challenging. This work proposes a language-guided uncertainty envelope mechanism that leverages textual descriptions to provide coarse-scale priors and adaptively selects image-specific affine calibration parameters within an uncertainty-aware range, thereby avoiding reliance on noisy text-point estimates. The approach freezes both the relative depth backbone and the CLIP text encoder, integrating multi-scale visual feature pooling with an inverse-depth-space affine transformation to enable efficient and lightweight calibration. It achieves improved in-domain accuracy on NYUv2 and KITTI and demonstrates significantly better zero-shot transfer performance on SUN-RGBD and DDAD compared to purely language-based baselines, exhibiting enhanced robustness.

Technology Category

Application Category

📝 Abstract

Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.

Problem

Research questions and friction points this paper is trying to address.

monocular depth estimation

metric scale recovery

domain shift

global scale ambiguity

relative depth

Innovation

Methods, ideas, or system contributions that make the work stand out.

monocular depth estimation

language-vision calibration

metric scale recovery