SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Existing feedforward 3D scene understanding methods suffer from coarse semantic representations, low-fidelity geometric reconstruction, significant semantic noise, and reliance on dense view inputs—leading to high deployment costs. This paper proposes an end-to-end, sparse-view-driven framework for holistic 3D scene understanding, unifying geometric, appearance, and language-level semantic modeling. Key contributions include: (1) a semantic-aware anisotropic Gaussian field representation; (2) multi-source semantic feature fusion via a cross-view cost volume; and (3) a two-stage implicit semantic field distillation mechanism enabling open-vocabulary, promptable 3D segmentation. Experiments demonstrate substantial improvements over baselines such as LSM on both promptable and open-vocabulary 3D segmentation benchmarks. Our method achieves finer-grained geometry reconstruction, markedly reduces semantic noise, and supports real-time augmented reality interaction.

Technology Category

Application Category

📝 Abstract

Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene optimization methods rely on dense input views, which reduces practicality and increases complexity during deployment. In this paper, we propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which unifies 3D Gaussians with latent semantic attributes for joint geometry-appearance-semantics modeling. To predict the semantic anisotropic Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a cost volume representation that stores cross-view feature similarities, enhancing coherent and accurate scene comprehension. Leveraging a two-stage distillation framework, SemanticSplat reconstructs a holistic multi-modal semantic feature field from sparse-view images. Experiments demonstrate the effectiveness of our method for 3D scene understanding tasks like promptable and open-vocabulary segmentation. Video results are available at https://semanticsplat.github.io.

Problem

Research questions and friction points this paper is trying to address.

Limited language-based semantics in feed-forward 3D scene understanding

Low-quality geometry reconstruction and noisy artifacts in existing methods

Dependence on dense input views reduces practicality in per-scene optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies 3D Gaussians with latent semantic attributes

Fuses diverse feature fields for accurate comprehension

Uses two-stage distillation for multi-modal reconstruction

🔎 Similar Papers

Query3D: LLM-Powered Open-Vocabulary Scene Segmentation with Language Embedded 3D Gaussian