Tackling View-Dependent Semantics in 3D Language Gaussian Splatting

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address semantic misalignment in language-driven open-vocabulary segmentation for 3D Gaussian Splatting (3D-GS)—caused by neglecting view-dependent semantics—this paper proposes Language-Gaussian (LaGa), the first framework to formally model and exploit such semantics. LaGa achieves cross-view semantic alignment and aggregation via object-level scene decomposition, clustering of multi-view semantic descriptors, and a view-aware dynamic reweighting mechanism, overcoming the limitations of direct 2D feature projection. The method tightly integrates 3D-GS rendering, open-vocabulary language embeddings, and geometry-aware semantic modeling. Evaluated on the LERF-OVS benchmark, LaGa achieves an 18.7% improvement in mean Intersection-over-Union (mIoU) over prior state-of-the-art methods. The implementation is publicly released.

Technology Category

Application Category

📝 Abstract
Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints--a phenomenon we term view-dependent semantics. To address this challenge, we propose LaGa (Language Gaussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset. Our code is available at: https://github.com/SJTU-DeepVisionLab/LaGa.
Problem

Research questions and friction points this paper is trying to address.

Addressing view-dependent semantics in 3D language understanding
Bridging 2D and 3D semantic gaps in scene reconstruction
Improving multi-view semantic aggregation for 3D objects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes 3D scene into objects
Clusters semantic descriptors for aggregation
Reweights semantics based on multi-view
🔎 Similar Papers
No similar papers found.
Jiazhong Cen
Jiazhong Cen
Shanghai Jiao Tong University
Computer vision3D Scene Understanding
X
Xudong Zhou
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Jiemin Fang
Jiemin Fang
Senior Researcher, Huawei
Neural Rendering3D VisionAutoMLNeural Architecture SearchComputer Vision
Changsong Wen
Changsong Wen
sjtu
computer vision
L
Lingxi Xie
Huawei Technologies Co., Ltd.
X
Xiaopeng Zhang
Huawei Technologies Co., Ltd.
W
Wei Shen
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Q
Qi Tian
Huawei Technologies Co., Ltd.