🤖 AI Summary
Existing end-to-end DUSt3R-style methods estimate local point clouds solely from image pairs, lacking spatial memory and global consistency modeling, thus failing to support incremental, globally consistent metric reconstruction.
Method: We propose the first end-to-end dense SLAM framework based on gated recurrent states: a latent state serves as spatial memory, while a Transformer-driven gated update module enables sequential state evolution; combined with subgraph partitioning, local relative geometric constraint modeling, and global registration optimization, the framework ensures cross-frame geometric consistency. The method operates without scene priors or camera calibration, directly producing globally consistent, metrically accurate dense point clouds from RGB sequences in real time.
Contribution/Results: Our approach achieves significantly higher reconstruction accuracy than state-of-the-art methods across multiple standard benchmarks, while maintaining real-time performance.
📝 Abstract
DUSt3R-based end-to-end scene reconstruction has recently shown promising results in dense visual SLAM. However, most existing methods only use image pairs to estimate pointmaps, overlooking spatial memory and global consistency.To this end, we introduce GRS-SLAM3R, an end-to-end SLAM framework for dense scene reconstruction and pose estimation from RGB images without any prior knowledge of the scene or camera parameters. Unlike existing DUSt3R-based frameworks, which operate on all image pairs and predict per-pair point maps in local coordinate frames, our method supports sequentialized input and incrementally estimates metric-scale point clouds in the global coordinate. In order to improve consistent spatial correlation, we use a latent state for spatial memory and design a transformer-based gated update module to reset and update the spatial memory that continuously aggregates and tracks relevant 3D information across frames. Furthermore, we partition the scene into submaps, apply local alignment within each submap, and register all submaps into a common world frame using relative constraints, producing a globally consistent map. Experiments on various datasets show that our framework achieves superior reconstruction accuracy while maintaining real-time performance.