Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of semantic misalignment and excessive computational overhead in large-scale scenes, which hinder effective language-3D joint modeling, this paper proposes Language-in-Gaussians: a framework that embeds natural language directly into 3D Gaussian ellipsoids, enabling end-to-end geometric-semantic coupling. We introduce an ultra-low-dimensional semantic bottleneck feature alongside a decay-based downsampling module to mitigate cross-scale semantic misalignment. Furthermore, we integrate multi-resolution hash encoding with polynomial regularization to significantly improve training efficiency and generalization. Evaluated on the real-world complex-scene dataset HolyScenes, our method outperforms state-of-the-art approaches across semantic querying, editing, and multimodal reasoning tasks. It reduces GPU memory consumption by 42% and accelerates inference by 3.1×, establishing a scalable, efficient paradigm for large-scale open-world 3D language understanding.

Technology Category

Application Category

📝 Abstract
Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.
Problem

Research questions and friction points this paper is trying to address.

Embedding language in 3D scenes for semantic understanding and interaction
Addressing inefficiency and misalignment in large-scale 3D language feature learning
Improving 3D scene querying, editing, and reasoning with natural language
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-dimensional semantic bottleneck features in 3D Gaussians
Multi-resolution hash encoder for efficient rendering and memory
Attenuated Downsampler and regularizations to correct feature misalignment