π€ AI Summary
3D Gaussian Splatting (3DGS) lacks fine-grained semantic understanding and physical executability, hindering its application in Vision-and-Language Navigation (VLN). Method: We propose SAGE-3Dβthe first semantic-physically aligned, executable 3DGS environment for VLNβby integrating object-level semantic annotation, physically grounded collision modeling of 3D Gaussian point clouds, and object-centric semantic grounding to jointly optimize semantic comprehension and embodied interaction. Contributions/Results: (1) We introduce InteriorGS, a large-scale dataset comprising 1K indoor scenes rendered via 3DGS; (2) We release SAGE-Bench, the first 3DGS-based VLN benchmark; (3) On the VLN-CE Unseen task, SAGE-3D achieves a 31% relative improvement over state-of-the-art baselines, demonstrating strong zero-shot generalization and real-world transfer potential.
π Abstract
3D Gaussian Splatting (3DGS), a 3D representation method with photorealistic real-time rendering capabilities, is regarded as an effective tool for narrowing the sim-to-real gap. However, it lacks fine-grained semantics and physical executability for Visual-Language Navigation (VLN). To address this, we propose SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation), a new paradigm that upgrades 3DGS into an executable, semantically and physically aligned environment. It comprises two components: (1) Object-Centric Semantic Grounding, which adds object-level fine-grained annotations to 3DGS; and (2) Physics-Aware Execution Jointing, which embeds collision objects into 3DGS and constructs rich physical interfaces. We release InteriorGS, containing 1K object-annotated 3DGS indoor scene data, and introduce SAGE-Bench, the first 3DGS-based VLN benchmark with 2M VLN data. Experiments show that 3DGS scene data is more difficult to converge, while exhibiting strong generalizability, improving baseline performance by 31% on the VLN-CE Unseen task. The data and code will be available soon.