🤖 AI Summary
This work addresses open-vocabulary semantic segmentation in 3D scenes, achieving the first end-to-end open-vocabulary segmentation over the complete 3D volumetric space of both NeRF and 3D Gaussian Splatting (3DGS), overcoming the limitation of prior methods that produce only 2D masks. The proposed method introduces point-level language embedding field supervision, a cross-representation semantic transfer mechanism (NeRF → 3DGS), and the first geometry-semantic joint 3D query evaluation protocol. It integrates 3D point cloud language embedding learning, CLIP feature distillation, and voxel-level semantic querying. Evaluated on ScanNet and Objaverse, it achieves state-of-the-art 3D semantic segmentation accuracy while enabling real-time rendering (>60 FPS). This work establishes a novel paradigm for open-vocabulary 3D understanding.
📝 Abstract
Understanding the 3D semantics of a scene is a fundamental problem for various scenarios such as embodied agents. While NeRFs and 3DGS excel at novel-view synthesis, previous methods for understanding their semantics have been limited to incomplete 3D understanding: their segmentation results are rendered as 2D masks that do not represent the entire 3D space. To address this limitation, we redefine the problem to segment the 3D volume and propose the following methods for better 3D understanding. We directly supervise the 3D points to train the language embedding field, unlike previous methods that anchor supervision at 2D pixels. We transfer the learned language field to 3DGS, achieving the first real-time rendering speed without sacrificing training time or accuracy. Lastly, we introduce a 3D querying and evaluation protocol for assessing the reconstructed geometry and semantics together. Code, checkpoints, and annotations are available at the project page.