VGGT-SLAM 2.0: Real-time Dense Feed-forward Scene Reconstruction

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work proposes a real-time RGB feedforward SLAM system to address high-dimensional drift, planar degeneracy, and reconstruction ambiguity caused by unknown camera intrinsics in VGGT-SLAM. By reformulating the factor graph, the system effectively suppresses 15-degree-of-freedom drift and degeneracy. Notably, it is the first to demonstrate that specific attention layers within VGGT can be directly leveraged for image matching verification, enhancing loop closure accuracy without additional training. Integrated with a lightweight design, the system achieves real-time inference on the Jetson Thor platform. Experimental results show a ~23% reduction in pose error on the TUM dataset and robust performance across diverse environments—including apartments, offices, and a 4,200-square-foot barn—while simultaneously supporting open-set object detection.

Technology Category

Application Category

📝 Abstract

We present VGGT-SLAM 2.0, a real-time RGB feed-forward SLAM system which substantially improves upon VGGT-SLAM for incrementally aligning submaps created from VGGT. Firstly, we remove high-dimensional 15-degree-of-freedom drift and planar degeneracy from VGGT-SLAM by creating a new factor graph design while still addressing the reconstruction ambiguity of VGGT given unknown camera intrinsics. Secondly, by studying the attention layers of VGGT, we show that one of the layers is well suited to assist in image retrieval verification for free without additional training, which enables both rejecting false positive matches and allows for completing more loop closures. Finally, we conduct a suite of experiments which includes showing VGGT-SLAM 2.0 can easily be adapted for open-set object detection and demonstrating real-time performance while running online onboard a ground robot using a Jetson Thor. We test in environments ranging from cluttered indoor apartments and office scenes to a 4,200 square foot barn, and we also demonstrate VGGT-SLAM 2.0 achieves the highest accuracy on the TUM dataset with about 23 percent less pose error than VGGT-SLAM. Code will be released upon publication.

Problem

Research questions and friction points this paper is trying to address.

drift

degeneracy

reconstruction ambiguity

loop closure

false positive matches

Innovation

Methods, ideas, or system contributions that make the work stand out.

factor graph optimization

attention-based image retrieval

real-time SLAM