🤖 AI Summary
To address the insufficient robustness of visual/LiDAR loop closure detection in GNSS-denied, unstructured environments (e.g., planetary exploration), this paper proposes MPRF, a multimodal fusion framework. Methodologically, it introduces a novel two-stage retrieval strategy: (1) candidate frame selection via joint embedding of DINOv2 and SALAD visual features; and (2) simultaneous place recognition and geometric verification by integrating SONATA LiDAR descriptors with 6-DoF pose estimation. Its key innovation lies in deeply coupling vision foundation models with semantic LiDAR descriptors, thereby overcoming matching failures caused by texture scarcity and point cloud sparsity. Evaluated on the S3LI and Vulcano datasets, MPRF significantly outperforms state-of-the-art methods—achieving high-precision loop closure detection even in low-texture regions, enhancing SLAM backend pose estimation robustness, and delivering interpretable cross-modal correspondence results.
📝 Abstract
Robust loop closure detection is a critical component of Simultaneous Localization and Mapping (SLAM) algorithms in GNSS-denied environments, such as in the context of planetary exploration. In these settings, visual place recognition often fails due to aliasing and weak textures, while LiDAR-based methods suffer from sparsity and ambiguity. This paper presents MPRF, a multimodal pipeline that leverages transformer-based foundation models for both vision and LiDAR modalities to achieve robust loop closure in severely unstructured environments. Unlike prior work limited to retrieval, MPRF integrates a two-stage visual retrieval strategy with explicit 6-DoF pose estimation, combining DINOv2 features with SALAD aggregation for efficient candidate screening and SONATA-based LiDAR descriptors for geometric verification. Experiments on the S3LI dataset and S3LI Vulcano dataset show that MPRF outperforms state-of-the-art retrieval methods in precision while enhancing pose estimation robustness in low-texture regions. By providing interpretable correspondences suitable for SLAM back-ends, MPRF achieves a favorable trade-off between accuracy, efficiency, and reliability, demonstrating the potential of foundation models to unify place recognition and pose estimation. Code and models will be released at github.com/DLR-RM/MPRF.