🤖 AI Summary
In uncalibrated monocular dense RGB SLAM, projective ambiguity causes submap alignment failure, degrading geometric consistency and map completeness—particularly in long sequences where VGGT-based methods are constrained by GPU memory.
Method: We propose the first dense monocular SLAM framework formulated on the SL(4) manifold, jointly optimizing globally consistent submap alignment and loop closure constraints within the 15-degree-of-freedom projective transformation space. Unlike conventional similarity transformations (SE(3) × ℝ⁺), our approach explicitly models and eliminates scale and projective ambiguities induced by unknown camera intrinsics. The system integrates VGGT-based feedforward scene reconstruction, incremental submap building, projective-geometric constraints, and loop closure correction—without requiring prior knowledge of camera parameters.
Contribution/Results: Our method significantly improves dense map completeness and geometric consistency over long sequences, overcoming VGGT’s practical limitations in processing extended video streams under memory constraints.
📝 Abstract
We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.