SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the unreliability of vision-and-language navigation in complex environments caused by insufficient spatial awareness. The authors propose an end-to-end foundational model that learns generalizable 3D spatial priors directly from RGB video streams and explicitly injects spatial cues into the action reasoning process via a single compact token. Innovatively integrating a chain-of-thought–like mechanism, the model achieves cross-scene generalization without explicit spatial supervision. Built upon a vision-language architecture, it jointly leverages occupancy prediction and multi-task co-training to significantly enhance navigation robustness. The approach attains state-of-the-art performance across three benchmarks encompassing diverse indoor and outdoor scenes and demonstrates strong generalization and reliability in real-world experiments.

Technology Category

Application Category

📝 Abstract

Recent embodied navigation approaches leveraging Vision-Language Models (VLMs) demonstrate strong generalization in versatile Vision-Language Navigation (VLN). However, reliable path planning in complex environments remains challenging due to insufficient spatial awareness. In this work, we introduce SPAN-Nav, an end-to-end foundation model designed to infuse embodied navigation with universal 3D spatial awareness using RGB video streams. SPAN-Nav extracts spatial priors across diverse scenes through an occupancy prediction task on extensive indoor and outdoor environments. To mitigate the computational burden, we introduce a compact representation for spatial priors, finding that a single token is sufficient to encapsulate the coarse-grained cues essential for navigation tasks. Furthermore, inspired by the Chain-of-Thought (CoT) mechanism, SPAN-Nav utilizes this single spatial token to explicitly inject spatial cues into action reasoning through an end-to end framework. Leveraging multi-task co-training, SPAN-Nav captures task-adaptive cues from generalized spatial priors, enabling robust spatial awareness to generalize even to the task lacking explicit spatial supervision. To support comprehensive spatial learning, we present a massive dataset of 4.2 million occupancy annotations that covers both indoor and outdoor scenes across multi-type navigation tasks. SPAN-Nav achieves state-of-the-art performance across three benchmarks spanning diverse scenarios and varied navigation tasks. Finally, real-world experiments validate the robust generalization and practical reliability of our approach across complex physical scenarios.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation

Spatial Awareness

Embodied Navigation

Path Planning

3D Spatial Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial awareness

vision-language navigation

occupancy prediction