🤖 AI Summary
Addressing the challenges of high computational cost and inability to produce metric-scale depth in full-surround monocular depth estimation (FSMDE), this paper proposes a scale-consistency-oriented knowledge distillation framework. Methodologically, we introduce two novel distillation strategies: cross-interaction knowledge distillation to enhance scale invariance, and view-relational knowledge distillation to enforce multi-view depth consistency. Additionally, we integrate a classification-based depth binning scheme with a hybrid regression architecture to enable efficient distillation from a heavy teacher model to a lightweight student network. Evaluated on DDAD and nuScenes benchmarks, our approach significantly outperforms both conventional supervised methods and state-of-the-art distillation techniques. It achieves real-time inference speed while markedly improving metric depth accuracy—marking the first FSMDE solution that jointly delivers high efficiency, strong scale consistency, and geometrically plausible multi-view depth.
📝 Abstract
Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme--traditionally used in classification--with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.