🤖 AI Summary
This work addresses the limitations of traditional vision-and-language navigation (VLN), which relies on step-by-step decision-making and suffers from error accumulation and inefficiency. Existing map-based approaches are constrained by discrete path scoring or incremental graph updates, hindering continuous spatial reasoning. To overcome these issues, the authors propose a novel Top-Down VLN paradigm that reformulates navigation as a single-step global path planning problem on a pre-built top-down map, introducing the R2R-TopDown dataset to support this setting. They present NavOne, the first end-to-end framework capable of dense path probability prediction, thereby transcending the discrete-action bottleneck and enabling continuous reasoning. Experiments demonstrate that the method achieves state-of-the-art performance among map-based approaches on R2R-TopDown, with planning speeds eight times faster than existing map-based methods and 80 times faster than conventional egocentric-view approaches.
📝 Abstract
Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.