Self-Supervised Cross-Modal Learning for Image-to-Point Cloud Registration

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the semantic-geometric gap and susceptibility to local optima in image-to-point-cloud (I2P) cross-modal registration, this paper proposes CrossI2P, an end-to-end self-supervised framework. The method constructs a geometric-semantic fused embedding space, enabling robust cross-modal feature alignment via dual-path contrastive learning and dynamic gradient normalization. It further introduces global superpoint–superpixel matching coupled with a two-stage refinement strategy incorporating geometric constraints. Crucially, CrossI2P eliminates reliance on manual annotations by leveraging self-supervised training, a cross-modal interaction network, and a balanced multi-loss optimization mechanism. Evaluated on KITTI and nuScenes benchmarks, CrossI2P achieves 23.7% and 37.9% improvements in registration accuracy over prior methods, respectively, demonstrating substantial gains in robustness and generalization across diverse scenes and sensor configurations.

Technology Category

Application Category

📝 Abstract
Bridging 2D and 3D sensor modalities is critical for robust perception in autonomous systems. However, image-to-point cloud (I2P) registration remains challenging due to the semantic-geometric gap between texture-rich but depth-ambiguous images and sparse yet metrically precise point clouds, as well as the tendency of existing methods to converge to local optima. To overcome these limitations, we introduce CrossI2P, a self-supervised framework that unifies cross-modal learning and two-stage registration in a single end-to-end pipeline. First, we learn a geometric-semantic fused embedding space via dual-path contrastive learning, enabling annotation-free, bidirectional alignment of 2D textures and 3D structures. Second, we adopt a coarse-to-fine registration paradigm: a global stage establishes superpoint-superpixel correspondences through joint intra-modal context and cross-modal interaction modeling, followed by a geometry-constrained point-level refinement for precise registration. Third, we employ a dynamic training mechanism with gradient normalization to balance losses for feature alignment, correspondence refinement, and pose estimation. Extensive experiments demonstrate that CrossI2P outperforms state-of-the-art methods by 23.7% on the KITTI Odometry benchmark and by 37.9% on nuScenes, significantly improving both accuracy and robustness.
Problem

Research questions and friction points this paper is trying to address.

Bridging 2D images and 3D point clouds for registration
Overcoming semantic-geometric gap between cross-modal data
Addressing local optima convergence in image-point cloud alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised cross-modal learning framework
Dual-path contrastive embedding alignment
Coarse-to-fine two-stage registration paradigm
🔎 Similar Papers
No similar papers found.