🤖 AI Summary
This work addresses the limitations in vision-and-language navigation stemming from inadequate scene understanding and poor cross-environment generalization due to the neglect of complex 3D geometry and open-set semantics. To overcome these challenges, the study introduces differentiable 3D Gaussian representations into the task for the first time, initializing Gaussian primitives from pseudo-LiDAR point clouds and proposing an open-set semantic grouping mechanism that jointly models both known and unknown objects. Building upon this representation, the method constructs an online-generated 3D Gaussian map that integrates multi-granularity spatial-semantic cues to guide navigation decisions, complemented by a multi-level action prediction strategy to enhance robustness. Extensive experiments on the R2R, R4R, and REVERIE benchmarks demonstrate substantial performance gains over existing approaches, validating the effectiveness of the proposed framework in both navigation accuracy and generalization capability.
📝 Abstract
Vision-language navigation (VLN) requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works equip agents with various scene representations to enhance spatial awareness, they often neglect the complex 3D geometry and rich semantics in VLN scenarios, limiting the ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Scene Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors for scene understanding. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world, resulting in a unified 3D Gaussian Map. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist agents in decision-making. Extensive experiments conducted on three public benchmarks (i.e., R2R, R4R, and REVERIE) validate the effectiveness of our method.