🤖 AI Summary
To address the high annotation cost and heavy reliance on manual labels in LiDAR-based autonomous driving perception, this paper proposes LaserMix++, a semi-supervised 3D scene understanding framework. Methodologically, it introduces the first multi-modal LaserMix data augmentation strategy, enabling fine-grained cross-modal fusion of LiDAR and camera features; incorporates cross-sensor feature distillation and large language model (LLM)-guided knowledge supervision to establish a 3D consistency regularization mechanism. Evaluated on benchmarks including nuScenes, LaserMix++ achieves full-supervision performance using only 20% labeled data, significantly outperforming existing semi-supervised approaches. This work is the first to integrate LLM-prior knowledge into multi-modal semi-supervised 3D understanding, empirically validating its effectiveness in improving annotation efficiency, generalization capability, and geometric consistency modeling in 3D space.
📝 Abstract
Efficient data utilization is crucial for advancing 3D scene understanding in autonomous driving, where reliance on heavily human-annotated LiDAR point clouds challenges fully supervised methods. Addressing this, our study extends into semi-supervised learning for LiDAR semantic segmentation, leveraging the intrinsic spatial priors of driving scenes and multi-sensor complements to augment the efficacy of unlabeled datasets. We introduce LaserMix++, an evolved framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to further assist data-efficient learning. Our framework is tailored to enhance 3D scene consistency regularization by incorporating multi-modality, including 1) multi-modal LaserMix operation for fine-grained cross-sensor interactions; 2) camera-to-LiDAR feature distillation that enhances LiDAR feature learning; and 3) language-driven knowledge guidance generating auxiliary supervisions using open-vocabulary models. The versatility of LaserMix++ enables applications across LiDAR representations, establishing it as a universally applicable solution. Our framework is rigorously validated through theoretical analysis and extensive experiments on popular driving perception datasets. Results demonstrate that LaserMix++ markedly outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations and significantly improving the supervised-only baselines. This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.