🤖 AI Summary
This work addresses the challenge of real-time, queryable open-vocabulary semantic 3D reconstruction. Methodologically, it introduces the first feed-forward semantic Gaussian splatting framework, pioneering the integration of open-set semantic segmentation into the 3D Gaussian splatting pipeline: multi-view vision-language features are extracted via 2D foundation models to construct a compact semantic memory bank, enabling joint prediction—in a single forward pass—of geometry, appearance, and open-vocabulary semantic indices for each Gaussian ellipsoid, without scene-level optimization. Compared to existing approaches, our method achieves state-of-the-art geometric fidelity while delivering robust pixel-level open-vocabulary semantic labeling. This significantly enhances semantic queryability and generalization across unseen categories in 3D scenes, establishing an efficient and scalable semantic 3D foundation for applications such as robotic interaction and augmented reality.
📝 Abstract
We have introduced SegSplat, a novel framework designed to bridge the gap between rapid, feed-forward 3D reconstruction and rich, open-vocabulary semantic understanding. By constructing a compact semantic memory bank from multi-view 2D foundation model features and predicting discrete semantic indices alongside geometric and appearance attributes for each 3D Gaussian in a single pass, SegSplat efficiently imbues scenes with queryable semantics. Our experiments demonstrate that SegSplat achieves geometric fidelity comparable to state-of-the-art feed-forward 3D Gaussian Splatting methods while simultaneously enabling robust open-set semantic segmentation, crucially extit{without} requiring any per-scene optimization for semantic feature integration. This work represents a significant step towards practical, on-the-fly generation of semantically aware 3D environments, vital for advancing robotic interaction, augmented reality, and other intelligent systems.