🤖 AI Summary
Monocular depth estimation suffers from scale ambiguity and domain shift, making metric depth recovery challenging. This work proposes a language-guided uncertainty envelope mechanism that leverages textual descriptions to provide coarse-scale priors and adaptively selects image-specific affine calibration parameters within an uncertainty-aware range, thereby avoiding reliance on noisy text-point estimates. The approach freezes both the relative depth backbone and the CLIP text encoder, integrating multi-scale visual feature pooling with an inverse-depth-space affine transformation to enable efficient and lightweight calibration. It achieves improved in-domain accuracy on NYUv2 and KITTI and demonstrates significantly better zero-shot transfer performance on SUN-RGBD and DDAD compared to purely language-based baselines, exhibiting enhanced robustness.
📝 Abstract
Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.