EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding

๐Ÿ“… 2026-03-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the need for real-time, open-vocabulary 3D semantic understanding in embodied agents by proposing an online, feedforward 3D Gaussian Splatting (3DGS) method that simultaneously performs full-scene geometric reconstruction and open-vocabulary semantic labeling from a continuous image streamโ€”without per-scene optimization. The key innovation lies in achieving, for the first time, open-vocabulary, online, and feedforward 3D semantic reconstruction: it leverages an online sparse coefficient field and a CLIP global codebook to efficiently map 2D semantics onto 3D Gaussians, while integrating geometry-aware 3D features to enhance language embeddings. Experiments demonstrate near real-time performance on indoor benchmarks including ScanNet, ScanNet++, and Replica, with strong generalization, high efficiency, and accurate semantic interpretation.

Technology Category

Application Category

๐Ÿ“ Abstract
Understanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D scene in an online and nearly real-time manner. In this study, we propose EmbodiedSplat, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods which are typically restricted to either offline or per-scene optimization setting, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D CLIP embeddings to each 3D Gaussian while minimizing memory consumption and preserving the full semantic generalizability of CLIP. Furthermore, we generate 3D geometric-aware CLIP features by aggregating the partial point cloud of 3DGS through 3D U-Net to compensate the 3D geometric prior to 2D-oriented language embeddings. Extensive experiments on diverse indoor datasets, including ScanNet, ScanNet++, and Replica, demonstrate both the effectiveness and efficiency of our method. Check out our project page in https://0nandon.github.io/EmbodiedSplat/.
Problem

Research questions and friction points this paper is trying to address.

online 3D reconstruction
open-vocabulary 3D understanding
embodied scene understanding
real-time semantic 3D
streaming image processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online 3D Reconstruction
Open-Vocabulary Semantic Understanding
3D Gaussian Splatting
CLIP Embedding
Feed-Forward Architecture
๐Ÿ”Ž Similar Papers
No similar papers found.