🤖 AI Summary
Monocular 3D detection suffers from depth ambiguity, rendering conventional cross-modal knowledge distillation (LiDAR teacher → monocular student) ineffective. To address this, we propose a monocular teaching-assistant knowledge distillation framework that introduces a residual spatial cue-based teaching assistant model. This assistant explicitly models geometric discrepancies between the LiDAR teacher and monocular student in 3D space, enabling effective transfer of depth-aware perception under a purely visual paradigm for the first time. Our method comprises three key components: (1) multi-stage knowledge distillation; (2) 3D spatial residual feature modeling; and (3) monocular depth-geometric consistency constraints coupled with cross-modal feature alignment. Evaluated on KITTI 3D, our approach achieves state-of-the-art performance. It further generalizes robustly to multi-view nuScenes and unsupervised KITTI raw datasets, delivering significant improvements in both accuracy and robustness.
📝 Abstract
Monocular 3D object detection (Mono3D) holds noteworthy promise for autonomous driving applications owing to the cost-effectiveness and rich visual context of monocular camera sensors. However, depth ambiguity poses a significant challenge, as it requires extracting precise 3D scene geometry from a single image, resulting in suboptimal performance when transferring knowledge from a LiDAR-based teacher model to a camera-based student model. To address this issue, we introduce {em Monocular Teaching Assistant Knowledge Distillation (MonoTAKD)} to enhance 3D perception in Mono3D. Our approach presents a robust camera-based teaching assistant model that effectively bridges the representation gap between different modalities for teacher and student models, addressing the challenge of inaccurate depth estimation. By defining 3D spatial cues as residual features that capture the differences between the teacher and the teaching assistant models, we leverage these cues into the student model, improving its 3D perception capabilities. Experimental results show that our MonoTAKD achieves state-of-the-art performance on the KITTI3D dataset. Additionally, we evaluate the performance on nuScenes and KITTI raw datasets to demonstrate the generalization of our model to multi-view 3D and unsupervised data settings. Our code will be available at https://github.com/hoiliu-0801/MonoTAKD.