π€ AI Summary
To address the challenge of simultaneously achieving high geometric accuracy and open-world semantic understanding in monocular SLAM, this paper introduces GS-SLAMβthe first monocular Gaussian Splatting SLAM framework that requires neither depth sensors nor semantic annotations. Methodologically, it pioneers the integration of vision foundation models (MASt3R, SAM, CLIP) into Gaussian Splatting SLAM, and designs a memory mechanism tailored for high-dimensional semantic features to enable self-supervised alignment between 3D Gaussian representations and open-vocabulary semantics. Key contributions include: (1) the first unified map supporting both centimeter-level geometric precision and open-set semantic expressivity; (2) competitive or superior performance on both closed-set and open-set semantic segmentation benchmarks; and (3) complete independence from depth inputs and 3D semantic ground truth, significantly enhancing robustness and scalability in open environments.
π Abstract
Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.