SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

📅 2025-03-23

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Current 3D semantic understanding methods heavily rely on 2D images or text modalities, lacking pure 3D end-to-end learning capability and tailored semantic modeling frameworks and data for 3D Gaussian Splatting (3DGS). To address this, we propose the first pure 3D end-to-end semantic learning paradigm for arbitrary-category indoor scenes. Our method introduces a native 3DGS-compatible semantic understanding framework, a self-supervised 3D feature contrastive learning strategy, and the first large-scale 3DGS-based indoor dataset—SceneSplat-7K (6,868 scenes). Furthermore, we design a cross-dataset 3DGS rendering alignment mechanism coupled with vision-language pretraining guidance to achieve 3D semantic disentanglement. Extensive experiments on SceneSplat-7K demonstrate substantial improvements over diverse baselines, validating the effectiveness, generalizability, and scalability of pure 3D modality semantic understanding. This work establishes foundational infrastructure for standardized semantic reasoning powered by 3DGS.

Technology Category

Application Category

📝 Abstract

Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training, or together at inference. This highlights a clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable fashion remains an open challenge. To address these limitations we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. In order to power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes derived from 7 established datasets like ScanNet, Matterport3D, etc. Generating SceneSplat-7K required computational resources equivalent to 119 GPU-days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed methods over the established baselines.

Problem

Research questions and friction points this paper is trying to address.

Recognizing arbitrary categories in 3D scenes end-to-end

Integrating semantic reasoning into 3D Gaussian Splatting effectively

Lack of large-scale 3DGS datasets for indoor scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale 3DGS indoor scene understanding

Self-supervised 3D feature learning from unlabeled scenes

Introduces SceneSplat-7K dataset for standardized benchmarking

🔎 Similar Papers

UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction