🤖 AI Summary
Existing navigation benchmarks predominantly emphasize semantic understanding, lacking systematic evaluation of spatial perception and reasoning capabilities. To address this gap, we propose NavSpace—a novel benchmark comprising six categories of spatial reasoning tasks and 1,228 trajectory-instruction pairs—establishing the first comprehensive evaluation framework for spatial intelligence in embodied navigation. Concurrently, we introduce SNav, a new spatially intelligent navigation model that integrates multimodal large language models with an explicit spatial reasoning architecture. Comprehensive evaluation across 22 state-of-the-art navigation agents on NavSpace demonstrates SNav’s superior performance; its generalizability and practicality are further validated on real robotic platforms. This work bridges critical gaps in both the assessment and modeling of spatial intelligence for embodied navigation, uncovering fundamental challenges—including geometric understanding, topological reasoning, and dynamic spatial alignment—that remain central to advancing autonomous navigation systems.
📝 Abstract
Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.