🤖 AI Summary
To address robust place recognition for autonomous driving under GPS-denied conditions, this paper proposes a multimodal scene representation method based on 3D Gaussian splatting. It is the first to jointly model multi-view RGB images and LiDAR point clouds as a spatiotemporally consistent, differentiable, explicit 3D Gaussian scene. By performing geometric alignment and interpretable fusion of cross-modal data directly in physical space—bypassing opaque feature-level fusion—the approach significantly enhances transparency and generalizability of cross-modal correspondence. Integrated with 3D graph convolution and Transformer architectures, it enables end-to-end differentiable rendering and place matching. Evaluated on three benchmark datasets, the method achieves state-of-the-art accuracy and demonstrates strong cross-scene generalization. The source code is publicly available.
📝 Abstract
Place recognition is a crucial component that enables autonomous vehicles to obtain localization results in GPS-denied environments. In recent years, multimodal place recognition methods have gained increasing attention. They overcome the weaknesses of unimodal sensor systems by leveraging complementary information from different modalities. However, most existing methods explore cross-modality correlations through feature-level or descriptor-level fusion, suffering from a lack of interpretability. Conversely, the recently proposed 3D Gaussian Splatting provides a new perspective on multimodal fusion by harmonizing different modalities into an explicit scene representation. In this paper, we propose a 3D Gaussian Splatting-based multimodal place recognition network dubbed GSPR. It explicitly combines multi-view RGB images and LiDAR point clouds into a spatio-temporally unified scene representation with the proposed Multimodal Gaussian Splatting. A network composed of 3D graph convolution and transformer is designed to extract spatio-temporal features and global descriptors from the Gaussian scenes for place recognition. Extensive evaluations on three datasets demonstrate that our method can effectively leverage complementary strengths of both multi-view cameras and LiDAR, achieving SOTA place recognition performance while maintaining solid generalization ability. Our open-source code will be released at https://github.com/QiZS-BIT/GSPR.