🤖 AI Summary
Existing LiDAR–camera calibration methods rely on artificial targets, manual initial pose estimation, or intensive preprocessing, limiting generalizability and online applicability. To address this, we propose a fully self-supervised, target-agnostic online joint calibration framework. Our method operates directly on raw point clouds and images, establishing cross-modal correspondences via object-level feature matching, and integrates a four-stage pipeline: coarse alignment, homography-based initialization, incremental iterative optimization, and ViT-based cross-attention refinement. We introduce the first purely data-driven, end-to-end dynamic extrinsic parameter optimization mechanism—requiring no calibration targets, prior initialization, ground-truth supervision, or human intervention. Evaluated on urban traffic datasets, our approach achieves sub-pixel reprojection accuracy, matching or surpassing manually calibrated baselines—while performing zero preprocessing, zero ground-truth usage, and zero manual involvement throughout the entire process.
📝 Abstract
Accurate multi-sensor calibration is essential for deploying robust perception systems in applications such as autonomous driving, robotics, and intelligent transportation. Existing LiDAR-camera calibration methods often rely on manually placed targets, preliminary parameter estimates, or intensive data preprocessing, limiting their scalability and adaptability in real-world settings. In this work, we propose a fully automatic, targetless, and online calibration framework, CalibRefine, which directly processes raw LiDAR point clouds and camera images. Our approach is divided into four stages: (1) a Common Feature Discriminator that trains on automatically detected objects--using relative positions, appearance embeddings, and semantic classes--to generate reliable LiDAR-camera correspondences, (2) a coarse homography-based calibration, (3) an iterative refinement to incrementally improve alignment as additional data frames become available, and (4) an attention-based refinement that addresses non-planar distortions by leveraging a Vision Transformer and cross-attention mechanisms. Through extensive experiments on two urban traffic datasets, we show that CalibRefine delivers high-precision calibration results with minimal human involvement, outperforming state-of-the-art targetless methods and remaining competitive with, or surpassing, manually tuned baselines. Our findings highlight how robust object-level feature matching, together with iterative and self-supervised attention-based adjustments, enables consistent sensor fusion in complex, real-world conditions without requiring ground-truth calibration matrices or elaborate data preprocessing.