🤖 AI Summary
Existing methods struggle to achieve high-precision joint calibration of LiDAR, RGB, and event camera modalities without calibration targets. This work proposes the first end-to-end learning framework that operates without explicit calibration targets by introducing a shared LiDAR representation—fusing 3D point clouds and depth maps—to uniformly model geometric consistency across all three modalities, thereby significantly reducing redundant computation. Evaluated on the KITTI and DSEC datasets, the method achieves accuracy comparable to state-of-the-art two-modality calibration approaches while enabling, for the first time, target-free joint calibration of all three modalities. This advancement establishes a new benchmark for multi-sensor fusion in autonomous perception systems.
📝 Abstract
Advanced autonomous systems rely on multi-sensor fusion for safer and more robust perception. To enable effective fusion, calibrating directly from natural driving scenes (i.e., target-free) with high accuracy is crucial for precise multi-sensor alignment. Existing learning-based calibration methods are typically designed for only a single pair of sensor modalities (i.e., a bi-modal setup). Unlike these methods, we propose LiREC-Net, a target-free, learning-based calibration network that jointly calibrates multiple sensor modality pairs, including LiDAR, RGB, and event data, within a unified framework. To reduce redundant computation and improve efficiency, we introduce a shared LiDAR representation that leverages features from both its 3D nature and projected depth map, ensuring better consistency across modalities. Trained and evaluated on established datasets, such as KITTI and DSEC, our LiREC-Net achieves competitive performance to bi-modal models and sets a new strong baseline for the tri-modal use case.