🤖 AI Summary
This work addresses the low language-geometry alignment efficiency and reliance on rendering and per-scene fine-tuning in open-vocabulary 3D scene understanding. We propose a rendering-free, end-to-end method that directly maps natural language to 3D Gaussians. Our approach builds upon 3D Gaussian Splatting and achieves fine-grained spatial alignment via pixel-level ray-Gaussian intersection mapping. Key contributions include: (1) a novel language-feature injection mechanism that directly binds CLIP text embeddings to 3D Gaussian ellipsoid parameters—bypassing conventional rendering pipelines; and (2) a cross-scene universal product quantization (PQ) scheme for embedding compression, eliminating the need for per-scene adaptation. Evaluated on open-vocabulary 3D semantic segmentation, object localization, and interactive selection tasks, our method significantly outperforms prior art while offering high efficiency, strong generalization across unseen scenes, and full end-to-end trainability.
📝 Abstract
We introduce Dr. Splat, a novel approach for open-vocabulary 3D scene understanding leveraging 3D Gaussian Splatting. Unlike existing language-embedded 3DGS methods, which rely on a rendering process, our method directly associates language-aligned CLIP embeddings with 3D Gaussians for holistic 3D scene understanding. The key of our method is a language feature registration technique where CLIP embeddings are assigned to the dominant Gaussians intersected by each pixel-ray. Moreover, we integrate Product Quantization (PQ) trained on general large-scale image data to compactly represent embeddings without per-scene optimization. Experiments demonstrate that our approach significantly outperforms existing approaches in 3D perception benchmarks, such as open-vocabulary 3D semantic segmentation, 3D object localization, and 3D object selection tasks. For video results, please visit : https://drsplat.github.io/