🤖 AI Summary
This paper addresses unsupervised single-image sparse depth completion in the absence of ground-truth depth. We propose a domain-agnostic framework that innovatively integrates large-scale pre-trained monocular depth estimation (MDE) models with sparse depth inputs. Specifically, we first perform semantic-aware segmentation using the relative depth predicted by the MDE model; then, we generate high-fidelity pseudo-ground-truth depth pairs via randomized affine rescaling; finally, these guide an end-to-end optimized multimodal feature fusion network. This paradigm eliminates reliance on simplistic affine calibration, effectively mitigating depth scale drift and structural distortion. Evaluated on NYUv2 and KITTI benchmarks, our method consistently outperforms existing unsupervised approaches and direct MDE fine-tuning strategies, achieving state-of-the-art performance in both accuracy and robustness.
📝 Abstract
The problem of depth completion involves predicting a dense depth image from a single sparse depth map and an RGB image. Unsupervised depth completion methods have been proposed for various datasets where ground truth depth data is unavailable and supervised methods cannot be applied. However, these models require auxiliary data to estimate depth values, which is far from real scenarios. Monocular depth estimation (MDE) models can produce a plausible relative depth map from a single image, but there is no work to properly combine the sparse depth map with MDE for depth completion; a simple affine transformation to the depth map will yield a high error since MDE are inaccurate at estimating depth difference between objects. We introduce StarryGazer, a domain-agnostic framework that predicts dense depth images from a single sparse depth image and an RGB image without relying on ground-truth depth by leveraging the power of large MDE models. First, we employ a pre-trained MDE model to produce relative depth images. These images are segmented and randomly rescaled to form synthetic pairs for dense pseudo-ground truth and corresponding sparse depths. A refinement network is trained with the synthetic pairs, incorporating the relative depth maps and RGB images to improve the model's accuracy and robustness. StarryGazer shows superior results over existing unsupervised methods and transformed MDE results on various datasets, demonstrating that our framework exploits the power of MDE models while appropriately fixing errors using sparse depth information.