CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of cross-modal knowledge transfer from 2D vision foundation models to 3D LiDAR models—namely, reliance on pseudo-labels or hand-crafted losses—we propose a lightweight, self-supervised cross-modal distillation framework. Our method introduces (1) a novel distillation loss based directly on feature similarity, coupled with a lightweight MLP projection head, eliminating the need for pseudo semantic maps and manually designed loss functions; and (2) occupancy prediction as an auxiliary spatial task to jointly enhance semantic understanding and 3D geometric reasoning. Evaluated on autonomous driving multi-task benchmarks—including LiDAR semantic segmentation and 3D object detection—our approach achieves state-of-the-art performance, with up to a 10% improvement in mIoU. Notably, it maintains strong generalization even under extremely low-data fine-tuning regimes, demonstrating robustness and practical applicability.

Technology Category

Application Category

📝 Abstract
Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy
Problem

Research questions and friction points this paper is trying to address.

Transfer 2D vision model capabilities to 3D LiDAR models
Simplify cross-modal distillation without complex losses or pseudo-semantic maps
Enhance 3D spatial reasoning with auxiliary occupancy prediction task
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct feature similarity loss with MLP projection
No reliance on pseudo-semantic maps
Auxiliary self-supervised spatial occupancy prediction
🔎 Similar Papers
No similar papers found.