VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work proposes VGGT-MPR, a novel framework for multimodal place recognition in autonomous driving that addresses the poor generalization and high training costs associated with handcrafted fusion strategies and parameter-heavy backbone networks. VGGT-MPR introduces the Visual Geometry Grounded Transformer as a unified geometric engine to jointly process visual and LiDAR data. It leverages depth-aware supervision to extract geometry-rich visual embeddings, densifies LiDAR point clouds, and incorporates a training-free re-ranking mechanism to enhance retrieval accuracy. By integrating mask-guided keypoint extraction and confidence-aware matching scores, the method significantly improves robustness against environmental variations, viewpoint shifts, and occlusions. Extensive experiments on multiple large-scale autonomous driving benchmarks and self-collected datasets demonstrate state-of-the-art performance, substantially outperforming existing approaches.

Technology Category

Application Category

📝 Abstract

In autonomous driving, robust place recognition is critical for global localization and loop closure detection. While inter-modality fusion of camera and LiDAR data in multimodal place recognition (MPR) has shown promise in overcoming the limitations of unimodal counterparts, existing MPR methods basically attend to hand-crafted fusion strategies and heavily parameterized backbones that require costly retraining. To address this, we propose VGGT-MPR, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking. In the global retrieval stage, VGGT extracts geometrically-rich visual embeddings through prior depth-aware and point map supervision, and densifies sparse LiDAR point clouds with predicted depth maps to improve structural representation. This enhances the discriminative ability of fused multimodal features and produces global descriptors for fast retrieval. Beyond global retrieval, we design a training-free re-ranking mechanism that exploits VGGT's cross-view keypoint-tracking capability. By combining mask-guided keypoint extraction with confidence-aware correspondence scoring, our proposed re-ranking mechanism effectively refines retrieval results without additional parameter optimization. Extensive experiments on large-scale autonomous driving benchmarks and our self-collected data demonstrate that VGGT-MPR achieves state-of-the-art performance, exhibiting strong robustness to severe environmental changes, viewpoint shifts, and occlusions. Our code and data will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

multimodal place recognition

autonomous driving

LiDAR

camera

robust localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

VGGT

multimodal place recognition

geometry-aware fusion