๐ค AI Summary
Existing feedforward 3D scene understanding methods suffer from coarse semantic representations, low-fidelity geometric reconstruction, significant semantic noise, and reliance on dense view inputsโleading to high deployment costs. This paper proposes an end-to-end, sparse-view-driven framework for holistic 3D scene understanding, unifying geometric, appearance, and language-level semantic modeling. Key contributions include: (1) a semantic-aware anisotropic Gaussian field representation; (2) multi-source semantic feature fusion via a cross-view cost volume; and (3) a two-stage implicit semantic field distillation mechanism enabling open-vocabulary, promptable 3D segmentation. Experiments demonstrate substantial improvements over baselines such as LSM on both promptable and open-vocabulary 3D segmentation benchmarks. Our method achieves finer-grained geometry reconstruction, markedly reduces semantic noise, and supports real-time augmented reality interaction.
๐ Abstract
Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene optimization methods rely on dense input views, which reduces practicality and increases complexity during deployment. In this paper, we propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which unifies 3D Gaussians with latent semantic attributes for joint geometry-appearance-semantics modeling. To predict the semantic anisotropic Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a cost volume representation that stores cross-view feature similarities, enhancing coherent and accurate scene comprehension. Leveraging a two-stage distillation framework, SemanticSplat reconstructs a holistic multi-modal semantic feature field from sparse-view images. Experiments demonstrate the effectiveness of our method for 3D scene understanding tasks like promptable and open-vocabulary segmentation. Video results are available at https://semanticsplat.github.io.