SLGaussian: Fast Language Gaussian Splatting in Sparse Views

📅 2024-12-11
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency and multi-view optimization dependency of existing methods for constructing 3D semantic fields from sparse views, this paper introduces the first feed-forward, language-enhanced 3D Gaussian Splatting framework. Methodologically, it integrates SAM-based video tracking to ensure cross-frame segmentation consistency, designs a low-dimensional index embedding for high-dimensional CLIP features, and incorporates sparse-view geometric constraints to achieve spatial alignment and efficient optimization. Our method reconstructs an open-vocabulary 3D semantic field in under 30 seconds using only two input views, with per-query latency as low as 0.011 seconds. On the LERF and 3D-OVS benchmarks, it achieves significant improvements in IoU, localization accuracy, and mIoU over state-of-the-art approaches. To our knowledge, this is the first work enabling real-time, open-vocabulary 3D semantic understanding from sparse inputs.

Technology Category

Application Category

📝 Abstract
3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring consistent SAM segmentations through video tracking and using low-dimensional indexing for high-dimensional CLIP features, SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions. In experiments on two-view sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets, SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU. Moreover, our model achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.
Problem

Research questions and friction points this paper is trying to address.

Efficient 3D semantic field learning from sparse viewpoints
Accurate 3D scene understanding under limited view conditions
Fast open-vocabulary querying and segmentation in 3D space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward 3D semantic field construction
SAM segmentation with video tracking
Low-dimensional CLIP feature indexing
🔎 Similar Papers
No similar papers found.