Distilling Monocular Foundation Model for Fine-grained Depth Completion

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses dense depth map prediction from sparse LiDAR inputs, particularly tackling the challenge of recovering fine geometric details in the absence of dense depth annotations. To this end, we propose a two-stage knowledge distillation framework. Methodologically: (1) we pioneer the use of monocular foundation models (e.g., MiDaS) to provide geometric priors and dense supervision; (2) we design a LiDAR-simulation data generation strategy to bridge the domain gap between monocular and sparse LiDAR modalities; and (3) we introduce a Scale- and Shift-Invariant (SSI) loss to explicitly model the inherent scale ambiguity of monocular depth estimation. The model is first self-supervised pre-trained on synthetic data and then fine-tuned on real-world sparse LiDAR inputs. Evaluated on the KITTI depth completion benchmark, our approach achieves state-of-the-art performance, ranking first.

Technology Category

Application Category

📝 Abstract
Depth completion involves predicting dense depth maps from sparse LiDAR inputs. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. In this paper, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth completion. In the first stage, we introduce a pre-training strategy that generates diverse training data from natural images, which distills geometric knowledge to depth completion. Specifically, we simulate LiDAR scans by utilizing monocular depth and mesh reconstruction, thereby creating training data without requiring ground-truth depth. Besides, monocular depth estimation suffers from inherent scale ambiguity in real-world settings. To address this, in the second stage, we employ a scale- and shift-invariant loss (SSI Loss) to learn real-world scales when fine-tuning on real-world datasets. Our two-stage distillation framework enables depth completion models to harness the strengths of monocular foundation models. Experimental results demonstrate that models trained with our two-stage distillation framework achieve state-of-the-art performance, ranking extbf{first place} on the KITTI benchmark. Code is available at https://github.com/Sharpiless/DMD3C
Problem

Research questions and friction points this paper is trying to address.

Predict dense depth maps from sparse LiDAR inputs
Overcome sparse depth annotation limitations with dense supervision
Resolve monocular depth scale ambiguity in real-world settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage knowledge distillation framework
Pre-training with monocular depth simulation
Scale-invariant loss for real-world fine-tuning
🔎 Similar Papers
No similar papers found.
Yingping Liang
Yingping Liang
2022~2028 Ph.D. student at Beijing Institute of Technology
3D Vision
Y
Yutao Hu
Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University
Wenqi Shao
Wenqi Shao
Researcher at Shanghai AI Laboratory
Foundation Model EvaluationLLM CompressionEfficient AdaptationMultimodal Learning
Y
Ying Fu
Beijing Institute of Technology