VGGT-SLAM++

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a high-precision visual SLAM framework to address pose drift in short-term trajectories and the lack of high-frequency local optimization in existing Transformer-based systems. The front-end integrates a Visual Geometry Grounded Transformer (VGGT) with Sim(3) pose estimation, while the back-end introduces a novel fusion of dense digital elevation models (DEMs) and DINOv2 feature embeddings to construct compact subgraphs. High-frequency local bundle adjustment is triggered by visual place recognition (VPR), effectively suppressing short-term drift and accelerating graph optimization convergence. The method achieves state-of-the-art accuracy on standard benchmarks while preserving global consistency in large-scale environments through sublinear-time retrieval.
📝 Abstract
We introduce VGGT-SLAM++, a complete visual SLAM system that leverages the geometry-rich outputs of the Visual Geometry Grounded Transformer (VGGT). The system comprises a visual odometry (front-end) fusing the VGGT feed-forward transformer and a Sim(3) solution, a Digital Elevation Map (DEM)-based graph construction module, and a back-end that jointly enable accurate large-scale mapping with bounded memory. While prior transformer-based SLAM pipelines such as VGGT-SLAM rely primarily on sparse loop closures or global Sim(3) manifold constraints - allowing short-horizon pose drift - VGGT-SLAM++ restores high-cadence local bundle adjustment (LBA) through a spatially corrective back-end. For each VGGT submap, we construct a dense planar-canonical DEM, partition it into patches, and compute their DINOv2 embeddings to integrate the submap into a covisibility graph. Spatial neighbors are retrieved using a Visual Place Recognition (VPR) module within the covisibility window, triggering frequent local optimization that stabilizes trajectories. Across standard SLAM benchmarks, VGGT-SLAM++ achieves state-of-the-art accuracy, substantially reducing short-term drift, accelerating graph convergence, and maintaining global consistency with compact DEM tiles and sublinear retrieval.
Problem

Research questions and friction points this paper is trying to address.

visual SLAM
pose drift
large-scale mapping
transformer-based SLAM
local bundle adjustment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual SLAM
Local Bundle Adjustment
Digital Elevation Map
Visual Place Recognition
Transformer-based Odometry
🔎 Similar Papers
No similar papers found.
A
Avilasha Mandal
Indian Institute of Technology Delhi
R
Rajesh Kumar
Addverb Technologies
S
Sudarshan Sunil Harithas
Brown University
Chetan Arora
Chetan Arora
Professor, IIT Delhi
Computer VisionMachine Learning