🤖 AI Summary
This work addresses the challenge of establishing reliable cross-modal correspondences between cameras and LiDAR under large initial extrinsic calibration errors. The authors propose an extrinsic-aware cross-attention framework that explicitly embeds extrinsic hypotheses into the cross-attention mechanism, enabling direct alignment of image patches and LiDAR point clusters in their native domains. By avoiding depth map projection, the method circumvents geometric distortions and achieves geometrically consistent cross-modal matching without relying on 2D projections. Through native-domain feature alignment and parameterized extrinsic modeling, the approach significantly enhances calibration robustness under large initial misalignments. Experiments demonstrate that the method achieves calibration accuracies of 88% and 99% on the KITTI and nuScenes datasets, respectively, substantially outperforming existing state-of-the-art techniques in large-error scenarios.
📝 Abstract
Accurate camera-LiDAR fusion relies on precise extrinsic calibration, which fundamentally depends on establishing reliable cross-modal correspondences under potentially large misalignments. Existing learning-based methods typically project LiDAR points into depth maps for feature fusion, which distorts 3D geometry and degrades performance when the extrinsic initialization is far from the ground truth. To address this issue, we propose an extrinsic-aware cross-attention framework that directly aligns image patches and LiDAR point groups in their native domains. The proposed attention mechanism explicitly injects extrinsic parameter hypotheses into the correspondence modeling process, enabling geometry-consistent cross-modal interaction without relying on projected 2D depth maps. Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in both accuracy and robustness. Under large extrinsic perturbations, our approach achieves accurate calibration in 88% of KITTI cases and 99% of nuScenes cases, substantially surpassing the second-best baseline. We have open sourced our code on https://github.com/gitouni/ProjFusion to benefit the community.