π€ AI Summary
Existing 3D Gaussian Splatting-based visual SLAM methods suffer significant performance degradation in dynamic environments due to their reliance on the static-scene assumption. To address this limitation, this work proposes GGD-SLAM, a framework that achieves robust camera localization and dense mapping without requiring semantic annotations or depth inputs. The method leverages a FIFO queue and a sequential attention mechanism to extract dynamic features, employs a dynamic feature enhancer to disentangle static and dynamic scene components, and introduces an occlusion-aware inpainting strategy alongside an interference-resistant adaptive SSIM loss function. Evaluated on real-world dynamic datasets, GGD-SLAM demonstrates state-of-the-art performance in both camera pose estimation and dense reconstruction.
π Abstract
Visual SLAM algorithms achieve significant improvements through the exploration of 3D Gaussian Splatting (3DGS) representations, particularly in generating high-fidelity dense maps. However, they depend on a static environment assumption and experience significant performance degradation in dynamic environments. This paper presents GGD-SLAM, a framework that employs a generalizable motion model to address the challenges of localization and dense mapping in dynamic environments - without predefined semantic annotations or depth input. Specifically, the proposed system employs a First-In-First-Out (FIFO) queue to manage incoming frames, facilitating dynamic semantic feature extraction through a sequential attention mechanism. This is integrated with a dynamic feature enhancer to separate static and dynamic components. Additionally, to minimize dynamic distractors' impact on the static components, we devise a method to fill occluded areas via static information sampling and design a distractor-adaptive Structure Similarity Index Measure (SSIM) loss tailored for dynamic environments, significantly enhancing the system's resilience. Experiments conducted on real-world dynamic datasets demonstrate that the proposed system achieves state-of-the-art performance in camera pose estimation and dense reconstruction in dynamic scenes.