🤖 AI Summary
To address semantic misalignment in language-driven open-vocabulary segmentation for 3D Gaussian Splatting (3D-GS)—caused by neglecting view-dependent semantics—this paper proposes Language-Gaussian (LaGa), the first framework to formally model and exploit such semantics. LaGa achieves cross-view semantic alignment and aggregation via object-level scene decomposition, clustering of multi-view semantic descriptors, and a view-aware dynamic reweighting mechanism, overcoming the limitations of direct 2D feature projection. The method tightly integrates 3D-GS rendering, open-vocabulary language embeddings, and geometry-aware semantic modeling. Evaluated on the LERF-OVS benchmark, LaGa achieves an 18.7% improvement in mean Intersection-over-Union (mIoU) over prior state-of-the-art methods. The implementation is publicly released.
📝 Abstract
Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints--a phenomenon we term view-dependent semantics. To address this challenge, we propose LaGa (Language Gaussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset. Our code is available at: https://github.com/SJTU-DeepVisionLab/LaGa.