Selective Transfer Learning of Cross-Modality Distillation for Monocular 3D Object Detection

📅 2024-10-01

🏛️ IEEE transactions on circuits and systems for video technology (Print)

📈 Citations: 11

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Monocular 3D object detection is inherently ill-posed due to the absence of precise depth information, and existing cross-modal knowledge distillation approaches often suffer from negative transfer caused by the modality gap between images and LiDAR. To address this issue, this work proposes MonoSTL, which presents the first systematic analysis of negative transfer in cross-modal distillation and introduces two novel components: Depth-Aware Selective Feature Distillation (DASFD) and Depth-Aware Selective Relation Distillation (DASRD). These modules leverage depth uncertainty to guide positive knowledge transfer and effectively integrate LiDAR-derived depth cues through structural alignment and selective distillation mechanisms. Extensive experiments demonstrate that MonoSTL significantly boosts the performance of various baseline models on both KITTI and NuScenes benchmarks, achieving state-of-the-art results and confirming its effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract

Monocular 3D object detection is a promising yet ill-posed task for autonomous vehicles due to the lack of accurate depth information. Cross-modality knowledge distillation could effectively transfer depth information from LiDAR to image-based network. However, modality gap between image and LiDAR seriously limits its accuracy. In this paper, we systematically investigate the negative transfer problem induced by modality gap in cross-modality distillation for the first time, including not only the architecture inconsistency issue but more importantly the feature overfitting issue. We propose a selective learning approach named MonoSTL to overcome these issues, which encourages positive transfer of depth information from LiDAR while alleviates the negative transfer on image-based network. On the one hand, we utilize similar architectures to ensure spatial alignment of features between image-based and LiDAR-based networks. On the other hand, we develop two novel distillation modules, namely Depth-Aware Selective Feature Distillation (DASFD) and Depth-Aware Selective Relation Distillation (DASRD), which selectively learn positive features and relationships of objects by integrating depth uncertainty into feature and relation distillations, respectively. Our approach can be seamlessly integrated into various CNN-based and DETR-based models, where we take three recent models on KITTI and a recent model on NuScenes for validation. Extensive experiments show that our approach considerably improves the accuracy of the base models and thereby achieves the best accuracy compared with all recently released SOTA models. The code is released on https://github.com/DingCodeLab/MonoSTL.

Problem

Research questions and friction points this paper is trying to address.

monocular 3D object detection

cross-modality distillation

modality gap

negative transfer

depth information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Transfer Learning

Cross-Modality Distillation

Depth-Aware Distillation