🤖 AI Summary
Reconstructing large-scale 3D geometry from pose-free RGB video streams—online, in real time, and with consistent scale—remains challenging. This paper introduces the first online RGB SLAM system based on 3D Gaussians: it eliminates test-time optimization and depth sensors, employing a feed-forward recurrent network to directly regress camera poses from optical flow. We pioneer the fusion of pseudo-depth estimation with 3D Gaussian mapping, augmented by a local map rendering mechanism for efficient tracking and dense reconstruction. Evaluated on Replica and TUM-RGBD, our method matches SplaTAM’s geometric accuracy while improving tracking speed by over 90%, and demonstrates practicality through real-world deployment. Key contributions are: (1) the first joint SLAM framework integrating 3D Gaussians and pseudo-depth; (2) feed-forward pose prediction that removes iterative optimization; and (3) a lightweight, end-to-end trainable online reconstruction paradigm.
📝 Abstract
Incrementally recovering real-sized 3D geometry from a pose-free RGB stream is a challenging task in 3D reconstruction, requiring minimal assumptions on input data. Existing methods can be broadly categorized into end-to-end and visual SLAM-based approaches, both of which either struggle with long sequences or depend on slow test-time optimization and depth sensors. To address this, we first integrate a depth estimator into an RGB-D SLAM system, but this approach is hindered by inaccurate geometric details in predicted depth. Through further investigation, we find that 3D Gaussian mapping can effectively solve this problem. Building on this, we propose an online 3D reconstruction method using 3D Gaussian-based SLAM, combined with a feed-forward recurrent prediction module to directly infer camera pose from optical flow. This approach replaces slow test-time optimization with fast network inference, significantly improving tracking speed. Additionally, we introduce a local graph rendering technique to enhance robustness in feed-forward pose prediction. Experimental results on the Replica and TUM-RGBD datasets, along with a real-world deployment demonstration, show that our method achieves performance on par with the state-of-the-art SplaTAM, while reducing tracking time by more than 90%.