MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection

📅 2024-04-07
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Monocular 3D detection suffers from depth ambiguity, rendering conventional cross-modal knowledge distillation (LiDAR teacher → monocular student) ineffective. To address this, we propose a monocular teaching-assistant knowledge distillation framework that introduces a residual spatial cue-based teaching assistant model. This assistant explicitly models geometric discrepancies between the LiDAR teacher and monocular student in 3D space, enabling effective transfer of depth-aware perception under a purely visual paradigm for the first time. Our method comprises three key components: (1) multi-stage knowledge distillation; (2) 3D spatial residual feature modeling; and (3) monocular depth-geometric consistency constraints coupled with cross-modal feature alignment. Evaluated on KITTI 3D, our approach achieves state-of-the-art performance. It further generalizes robustly to multi-view nuScenes and unsupervised KITTI raw datasets, delivering significant improvements in both accuracy and robustness.

Technology Category

Application Category

📝 Abstract
Monocular 3D object detection (Mono3D) holds noteworthy promise for autonomous driving applications owing to the cost-effectiveness and rich visual context of monocular camera sensors. However, depth ambiguity poses a significant challenge, as it requires extracting precise 3D scene geometry from a single image, resulting in suboptimal performance when transferring knowledge from a LiDAR-based teacher model to a camera-based student model. To address this issue, we introduce {em Monocular Teaching Assistant Knowledge Distillation (MonoTAKD)} to enhance 3D perception in Mono3D. Our approach presents a robust camera-based teaching assistant model that effectively bridges the representation gap between different modalities for teacher and student models, addressing the challenge of inaccurate depth estimation. By defining 3D spatial cues as residual features that capture the differences between the teacher and the teaching assistant models, we leverage these cues into the student model, improving its 3D perception capabilities. Experimental results show that our MonoTAKD achieves state-of-the-art performance on the KITTI3D dataset. Additionally, we evaluate the performance on nuScenes and KITTI raw datasets to demonstrate the generalization of our model to multi-view 3D and unsupervised data settings. Our code will be available at https://github.com/hoiliu-0801/MonoTAKD.
Problem

Research questions and friction points this paper is trying to address.

Monocular 3D object detection
Depth ambiguity challenge
Knowledge distillation between modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Camera-based teaching assistant model
Residual features for depth estimation
State-of-the-art 3D perception enhancement
🔎 Similar Papers
No similar papers found.
Hou-I Liu
Hou-I Liu
NYCU
Computer Vision
C
Christine Wu
University of Washington, Seattle, WA 98195, USA
J
Jen-Hao Cheng
University of Washington, Seattle, WA 98195, USA
Wenhao Chai
Wenhao Chai
Princeton University
Machine LearningComputer Vision
S
Shian-Yun Wang
University of Southern California, Los Angeles, CA 90007, USA
Gaowen Liu
Gaowen Liu
Cisco Research
machine learningcomputer visionmultimedia.
J
Jenq-Neng Hwang
University of Washington, Seattle, WA 98195, USA
Hong-Han Shuai
Hong-Han Shuai
National Yang Ming Chiao Tung University
Deep LearningData MiningMultimedia Processing
Wen-Huang Cheng
Wen-Huang Cheng
Professor, IEEE Fellow, National Taiwan University
Artificial IntelligenceMultimediaComputer VisionMachine Learning