🤖 AI Summary
Existing semantic SLAM systems struggle to balance denseness, efficiency, and scalability: explicit representations are limited by resolution and generalization to unobserved regions, while implicit methods often fail to achieve real-time performance. This work proposes GS3LAM, a novel framework that constructs a Semantic Gaussian Field (SG-Field) to jointly optimize camera poses, scene geometry, and semantics. It introduces Depth-adaptive Scale Regularization (DSR) to mitigate misalignment between Gaussian scales and surface geometry, and designs a Randomly Sampled Keyframe Mapping (RSKM) strategy to effectively suppress catastrophic forgetting. Experiments demonstrate that the proposed method significantly outperforms current state-of-the-art approaches across multiple benchmarks, achieving consistent improvements in tracking robustness, rendering quality, and semantic accuracy.
📝 Abstract
Recently, the multi-modal fusion of RGB, depth, and semantics has shown great potential in dense Simultaneous Localization and Mapping (SLAM). However, a prerequisite for generating consistent semantic maps is the availability of dense, efficient, and scalable scene representations. Existing semantic SLAM systems based on explicit representations are often limited by resolution and an inability to predict unknown areas. Conversely, implicit representations typically rely on time-consuming ray tracing, failing to meet real-time requirements. Fortunately, 3D Gaussian Splatting (3DGS) has emerged as a promising representation that combines the efficiency of point-based methods with the continuity of geometric structures. To this end, we propose GS3LAM, a Gaussian Semantic Splatting SLAM framework that processes multimodal data to render consistent, dense semantic maps in real-time. GS3LAM models the scene as a Semantic Gaussian Field (SG-Field) and jointly optimizes camera poses and the field via multimodal error constraints. Furthermore, a Depth-adaptive Scale Regularization (DSR) scheme is introduced to resolve misalignments between scale-invariant Gaussians and geometric surfaces. To mitigate catastrophic forgetting, we propose a Random Sampling-based Keyframe Mapping (RSKM) strategy, which demonstrates superior performance over common local covisibility optimization methods. Extensive experiments on benchmark datasets show that GS3LAM achieves increased tracking robustness, superior rendering quality, and enhanced semantic precision compared to state-of-the-art methods. Source code is available at https://github.com/lif314/GS3LAM.