🤖 AI Summary
Current 3D semantic understanding methods heavily rely on 2D images or text modalities, lacking pure 3D end-to-end learning capability and tailored semantic modeling frameworks and data for 3D Gaussian Splatting (3DGS). To address this, we propose the first pure 3D end-to-end semantic learning paradigm for arbitrary-category indoor scenes. Our method introduces a native 3DGS-compatible semantic understanding framework, a self-supervised 3D feature contrastive learning strategy, and the first large-scale 3DGS-based indoor dataset—SceneSplat-7K (6,868 scenes). Furthermore, we design a cross-dataset 3DGS rendering alignment mechanism coupled with vision-language pretraining guidance to achieve 3D semantic disentanglement. Extensive experiments on SceneSplat-7K demonstrate substantial improvements over diverse baselines, validating the effectiveness, generalizability, and scalability of pure 3D modality semantic understanding. This work establishes foundational infrastructure for standardized semantic reasoning powered by 3DGS.
📝 Abstract
Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training, or together at inference. This highlights a clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable fashion remains an open challenge. To address these limitations we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. In order to power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes derived from 7 established datasets like ScanNet, Matterport3D, etc. Generating SceneSplat-7K required computational resources equivalent to 119 GPU-days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed methods over the established baselines.