🤖 AI Summary
This paper addresses the coupled challenge of place recognition and 6-DoF pose estimation in LiDAR–vision cross-modal relocalization by proposing an end-to-end joint optimization framework. Methodologically: (1) a subgraph-guided cross-modal alignment strategy is introduced, integrating binocular depth prediction with probabilistic occupancy grids to extend the camera’s field of view; (2) a flexible positive-sample mechanism unifies place recognition and registration into a single learning objective; (3) a differentiable least-squares solver—weighted by inlier confidence—is adopted in lieu of RANSAC, significantly enhancing registration robustness under low inlier ratios. Evaluated on KITTI and KITTI360, the method achieves state-of-the-art performance, particularly improving pose accuracy in long-range query scenarios. To our knowledge, this is the first work to realize highly robust and accurate end-to-end cross-modal relocalization.
📝 Abstract
This paper proposes SOLVR, a unified pipeline for learning based LiDAR-Visual re-localisation which performs place recognition and 6-DoF registration across sensor modalities. We propose a strategy to align the input sensor modalities by leveraging stereo image streams to produce metric depth predictions with pose information, followed by fusing multiple scene views from a local window using a probabilistic occupancy framework to expand the limited field-of-view of the camera. Additionally, SOLVR adopts a flexible definition of what constitutes positive examples for different training losses, allowing us to simultaneously optimise place recognition and registration performance. Furthermore, we replace RANSAC with a registration function that weights a simple least-squares fitting with the estimated inlier likelihood of sparse keypoint correspondences, improving performance in scenarios with a low inlier ratio between the query and retrieved place. Our experiments on the KITTI and KITTI360 datasets show that SOLVR achieves state-of-the-art performance for LiDAR-Visual place recognition and registration, particularly improving registration accuracy over larger distances between the query and retrieved place.