SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation

πŸ“… 2026-03-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the unreliability of vision-and-language navigation in complex environments caused by insufficient spatial awareness. The authors propose an end-to-end foundational model that learns generalizable 3D spatial priors directly from RGB video streams and explicitly injects spatial cues into the action reasoning process via a single compact token. Innovatively integrating a chain-of-thought–like mechanism, the model achieves cross-scene generalization without explicit spatial supervision. Built upon a vision-language architecture, it jointly leverages occupancy prediction and multi-task co-training to significantly enhance navigation robustness. The approach attains state-of-the-art performance across three benchmarks encompassing diverse indoor and outdoor scenes and demonstrates strong generalization and reliability in real-world experiments.

Technology Category

Application Category

πŸ“ Abstract
Recent embodied navigation approaches leveraging Vision-Language Models (VLMs) demonstrate strong generalization in versatile Vision-Language Navigation (VLN). However, reliable path planning in complex environments remains challenging due to insufficient spatial awareness. In this work, we introduce SPAN-Nav, an end-to-end foundation model designed to infuse embodied navigation with universal 3D spatial awareness using RGB video streams. SPAN-Nav extracts spatial priors across diverse scenes through an occupancy prediction task on extensive indoor and outdoor environments. To mitigate the computational burden, we introduce a compact representation for spatial priors, finding that a single token is sufficient to encapsulate the coarse-grained cues essential for navigation tasks. Furthermore, inspired by the Chain-of-Thought (CoT) mechanism, SPAN-Nav utilizes this single spatial token to explicitly inject spatial cues into action reasoning through an end-to end framework. Leveraging multi-task co-training, SPAN-Nav captures task-adaptive cues from generalized spatial priors, enabling robust spatial awareness to generalize even to the task lacking explicit spatial supervision. To support comprehensive spatial learning, we present a massive dataset of 4.2 million occupancy annotations that covers both indoor and outdoor scenes across multi-type navigation tasks. SPAN-Nav achieves state-of-the-art performance across three benchmarks spanning diverse scenarios and varied navigation tasks. Finally, real-world experiments validate the robust generalization and practical reliability of our approach across complex physical scenarios.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation
Spatial Awareness
Embodied Navigation
Path Planning
3D Spatial Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial awareness
vision-language navigation
occupancy prediction
compact representation
multi-task co-training
πŸ”Ž Similar Papers
No similar papers found.
J
Jiahang Liu
Peking University
T
Tianyu Xu
Peking University
J
Jiawei Chen
Peking University
L
Lu Yue
Peking University
Jiazhao Zhang
Jiazhao Zhang
Peking University
Embodied AINavigation3D Vision
Z
Zhiyong Wang
Galbot
M
Minghan Li
Galbot
Q
Qisheng Zhao
Peking University
A
Anqi Li
Peking University
Qi Su
Qi Su
Peking University
Computational LinguisticsCorpus LinguisticsDigital Humanities
Z
Zhizheng Zhang
BAAI
He Wang
He Wang
Assistant Professor of Computer Science, Peking University
Embodied AIComputer VisionRobotics