Multi-modal Loop Closure Detection with Foundation Models in Severely Unstructured Environments

📅 2025-11-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of visual/LiDAR loop closure detection in GNSS-denied, unstructured environments (e.g., planetary exploration), this paper proposes MPRF, a multimodal fusion framework. Methodologically, it introduces a novel two-stage retrieval strategy: (1) candidate frame selection via joint embedding of DINOv2 and SALAD visual features; and (2) simultaneous place recognition and geometric verification by integrating SONATA LiDAR descriptors with 6-DoF pose estimation. Its key innovation lies in deeply coupling vision foundation models with semantic LiDAR descriptors, thereby overcoming matching failures caused by texture scarcity and point cloud sparsity. Evaluated on the S3LI and Vulcano datasets, MPRF significantly outperforms state-of-the-art methods—achieving high-precision loop closure detection even in low-texture regions, enhancing SLAM backend pose estimation robustness, and delivering interpretable cross-modal correspondence results.

Technology Category

Application Category

📝 Abstract
Robust loop closure detection is a critical component of Simultaneous Localization and Mapping (SLAM) algorithms in GNSS-denied environments, such as in the context of planetary exploration. In these settings, visual place recognition often fails due to aliasing and weak textures, while LiDAR-based methods suffer from sparsity and ambiguity. This paper presents MPRF, a multimodal pipeline that leverages transformer-based foundation models for both vision and LiDAR modalities to achieve robust loop closure in severely unstructured environments. Unlike prior work limited to retrieval, MPRF integrates a two-stage visual retrieval strategy with explicit 6-DoF pose estimation, combining DINOv2 features with SALAD aggregation for efficient candidate screening and SONATA-based LiDAR descriptors for geometric verification. Experiments on the S3LI dataset and S3LI Vulcano dataset show that MPRF outperforms state-of-the-art retrieval methods in precision while enhancing pose estimation robustness in low-texture regions. By providing interpretable correspondences suitable for SLAM back-ends, MPRF achieves a favorable trade-off between accuracy, efficiency, and reliability, demonstrating the potential of foundation models to unify place recognition and pose estimation. Code and models will be released at github.com/DLR-RM/MPRF.
Problem

Research questions and friction points this paper is trying to address.

Achieving robust loop closure detection in severely unstructured GNSS-denied environments
Overcoming visual aliasing and LiDAR sparsity in planetary exploration scenarios
Unifying place recognition and pose estimation with multimodal foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages transformer foundation models for vision and LiDAR
Integrates two-stage visual retrieval with 6-DoF pose estimation
Combines DINOv2 features with SALAD and SONATA descriptors
🔎 Similar Papers
No similar papers found.