🤖 AI Summary
To address challenges in visual–LiDAR odometry—including sensor misalignment, insufficient temporal information exploitation, and long-sequence error accumulation—this paper proposes a sparse spatiotemporal fusion framework for deep V-LiDAR odometry. The method integrates deep learning with geometric modeling to jointly optimize pose estimation across modalities and time. Key contributions include: (1) a sparse LiDAR query fusion mechanism enabling efficient cross-modal alignment; (2) a temporal interaction update module coupled with prediction-based initialization to enhance inter-frame consistency; and (3) temporal segment-wise training with collective average loss, facilitating global optimization over multiple frames and mitigating scale drift. Evaluated on KITTI and Argoverse benchmarks, the approach achieves state-of-the-art accuracy, significantly reducing pose estimation errors while maintaining real-time inference at 82 ms per frame.
📝 Abstract
Visual-LiDAR odometry is a critical component for autonomous system localization, yet achieving high accuracy and strong robustness remains a challenge. Traditional approaches commonly struggle with sensor misalignment, fail to fully leverage temporal information, and require extensive manual tuning to handle diverse sensor configurations. To address these problems, we introduce DVLO4D, a novel visual-LiDAR odometry framework that leverages sparse spatial-temporal fusion to enhance accuracy and robustness. Our approach proposes three key innovations: (1) Sparse Query Fusion, which utilizes sparse LiDAR queries for effective multi-modal data fusion; (2) a Temporal Interaction and Update module that integrates temporally-predicted positions with current frame data, providing better initialization values for pose estimation and enhancing model's robustness against accumulative errors; and (3) a Temporal Clip Training strategy combined with a Collective Average Loss mechanism that aggregates losses across multiple frames, enabling global optimization and reducing the scale drift over long sequences. Extensive experiments on the KITTI and Argoverse Odometry dataset demonstrate the superiority of our proposed DVLO4D, which achieves state-of-the-art performance in terms of both pose accuracy and robustness. Additionally, our method has high efficiency, with an inference time of 82 ms, possessing the potential for the real-time deployment.