🤖 AI Summary
To address weak cross-modal alignment between 2D images and 3D point clouds in autonomous driving, this paper proposes NCLR, a self-supervised framework introducing the novel pretraining task of “2D–3D neural calibration”, which jointly optimizes cross-modal feature alignment and rigid pose estimation. Methodologically, NCLR employs a learnable geometric transformation module to unify image and point cloud feature spaces, establishing dense pixel-to-point correspondences for fine-grained matching and joint global pose modeling. Compared to existing self-supervised approaches, NCLR achieves significant performance gains on downstream 3D perception tasks—including LiDAR semantic segmentation, 3D object detection, and panoptic segmentation—demonstrating that joint cross-modal representation learning substantially enhances 3D understanding. The framework establishes a new paradigm for unsupervised multi-sensor fusion, circumventing reliance on costly annotated 3D data while improving geometric consistency and semantic coherence across modalities.
📝 Abstract
This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes. Specifically, our approach, namely NCLR, focuses on 2D-3D neural calibration, a novel pretext task that estimates the rigid pose aligning camera and LiDAR coordinate systems. First, we propose the learnable transformation alignment to bridge the domain gap between image and point cloud data, converting features into a unified representation space for effective comparison and matching. Second, we identify the overlapping area between the image and point cloud with the fused features. Third, we establish dense 2D-3D correspondences to estimate the rigid pose. The framework not only learns fine-grained matching from points to pixels but also achieves alignment of the image and point cloud at a holistic level, understanding their relative pose. We demonstrate the efficacy of NCLR by applying the pre-trained backbone to downstream tasks, such as LiDAR-based 3D semantic segmentation, object detection, and panoptic segmentation. Comprehensive experiments on various datasets illustrate the superiority of NCLR over existing self-supervised methods. The results confirm that joint learning from different modalities significantly enhances the network's understanding abilities and effectiveness of learned representation. The code is publicly available at https://github.com/Eaphan/NCLR.