🤖 AI Summary
Monocular metric depth estimation suffers from inherent scale ambiguity, particularly failing in scenes lacking geometric priors—e.g., transparent or specular surfaces. To address this, we propose a training-free inverse problem framework leveraging diffusion models: it employs a pre-trained latent diffusion model (LDM) as a generative prior, conditions on the input RGB image, and—crucially—introduces differentiable stereo-geometric constraints (e.g., epipolar consistency and disparity-depth mapping) as regularization terms to jointly optimize absolute depth maps. By eliminating reliance on supervised fine-tuning, our method is fully plug-and-play. It achieves state-of-the-art performance across diverse indoor and outdoor scenes, significantly improving depth accuracy and robustness on challenging surfaces. This work establishes a novel paradigm integrating generative modeling with geometric reasoning, enabling geometry-aware depth reconstruction without task-specific training.
📝 Abstract
We introduce a novel framework for metric depth estimation that enhances pretrained diffusion-based monocular depth estimation (DB-MDE) models with stereo vision guidance. While existing DB-MDE methods excel at predicting relative depth, estimating absolute metric depth remains challenging due to scale ambiguities in single-image scenarios. To address this, we reframe depth estimation as an inverse problem, leveraging pretrained latent diffusion models (LDMs) conditioned on RGB images, combined with stereo-based geometric constraints, to learn scale and shift for accurate depth recovery. Our training-free solution seamlessly integrates into existing DB-MDE frameworks and generalizes across indoor, outdoor, and complex environments. Extensive experiments demonstrate that our approach matches or surpasses state-of-the-art methods, particularly in challenging scenarios involving translucent and specular surfaces, all without requiring retraining.