🤖 AI Summary
This work proposes X-GS, an end-to-end scalable and open framework that unifies real-time semantic enhancement and multimodal downstream tasks within 3D Gaussian splatting (3DGS). X-GS introduces the X-GS-Perceiver module to jointly optimize geometry and camera poses from pose-free RGB or RGB-D videos, while distilling high-dimensional semantic features from vision foundation models into the 3D Gaussian representation. The X-GS-Thinker component further integrates a vision-language model to enable multimodal reasoning. For the first time, this framework cohesively combines online SLAM, semantic distillation, and multimodal task execution. Leveraging online vector quantization, GPU-accelerated sampling, and a parallelized pipeline, X-GS efficiently supports semantic 3D reconstruction, zero-shot image captioning, and object detection in real-world scenarios, achieving real-time performance, broad generality, and strong scalability.
📝 Abstract
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.