SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields

๐Ÿ“… 2025-06-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing feedforward 3D scene understanding methods suffer from coarse semantic representations, low-fidelity geometric reconstruction, significant semantic noise, and reliance on dense view inputsโ€”leading to high deployment costs. This paper proposes an end-to-end, sparse-view-driven framework for holistic 3D scene understanding, unifying geometric, appearance, and language-level semantic modeling. Key contributions include: (1) a semantic-aware anisotropic Gaussian field representation; (2) multi-source semantic feature fusion via a cross-view cost volume; and (3) a two-stage implicit semantic field distillation mechanism enabling open-vocabulary, promptable 3D segmentation. Experiments demonstrate substantial improvements over baselines such as LSM on both promptable and open-vocabulary 3D segmentation benchmarks. Our method achieves finer-grained geometry reconstruction, markedly reduces semantic noise, and supports real-time augmented reality interaction.

Technology Category

Application Category

๐Ÿ“ Abstract
Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene optimization methods rely on dense input views, which reduces practicality and increases complexity during deployment. In this paper, we propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which unifies 3D Gaussians with latent semantic attributes for joint geometry-appearance-semantics modeling. To predict the semantic anisotropic Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a cost volume representation that stores cross-view feature similarities, enhancing coherent and accurate scene comprehension. Leveraging a two-stage distillation framework, SemanticSplat reconstructs a holistic multi-modal semantic feature field from sparse-view images. Experiments demonstrate the effectiveness of our method for 3D scene understanding tasks like promptable and open-vocabulary segmentation. Video results are available at https://semanticsplat.github.io.
Problem

Research questions and friction points this paper is trying to address.

Limited language-based semantics in feed-forward 3D scene understanding
Low-quality geometry reconstruction and noisy artifacts in existing methods
Dependence on dense input views reduces practicality in per-scene optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies 3D Gaussians with latent semantic attributes
Fuses diverse feature fields for accurate comprehension
Uses two-stage distillation for multi-modal reconstruction
๐Ÿ”Ž Similar Papers
No similar papers found.