LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing open-vocabulary 3D scene understanding methods often suffer from slow inference, high memory consumption, and complex pipelines. This work proposes the first training-free, lightweight framework that achieves efficient language-driven segmentation by assigning only 2-byte semantic indices to salient regions within a multi-view reconstructed 3D representation. The approach integrates a compact index-to-feature mapping with a single-step clustering strategy, drastically reducing both computational and storage overhead. It attains state-of-the-art performance on LERF-OVS, ScanNet, and DL3DV-OVS benchmarks while accelerating inference by 50–400× and reducing memory usage by 64× compared to prior methods. Complex indoor and outdoor scenes can thus be understood in under five seconds.

Technology Category

Application Category

📝 Abstract

Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments. To address this, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantic indices only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. We further ensure semantic consistency and efficient inference via single-step clustering that links geometrically and semantically related masks in 3D. We evaluate our method on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50-400x speedup and 64x lower memory, enabling scalable language-driven 3D understanding. For more details, visit our project page https://vision3d-lab.github.io/lightsplat/.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary 3D scene understanding

memory efficiency

inference speed

3D segmentation

semantic representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary 3D understanding

memory-efficient

training-free