City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM) evaluations predominantly focus on language-centric tasks or simulated environments, lacking rigorous assessment of real-world, knowledge-intensive sequential decision-making—particularly embodied visual navigation in urban settings. Method: We introduce Sparse Localization Visual Navigation, a novel benchmark—CityNav—spanning four global cities, requiring MLLMs to perform end-to-end, 50+ step city navigation solely from visual inputs, without external annotations, fine-tuning, or architectural modifications. We propose Verbalization of Path (VoP), a prompting strategy that explicitly elicits cognitive map generation (i.e., key landmarks and directional relations) to spatialize reasoning and enhance interpretability. Contribution/Results: VoP significantly improves navigation success rates across mainstream MLLMs on CityNav, outperforming Chain-of-Thought and Reflection baselines. This work provides the first empirical evidence that MLLMs can emergently acquire embodied navigation capabilities from web-scale pretraining, enabling zero-shot, vision-only urban wayfinding.

Technology Category

Application Category

📝 Abstract
Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs and standard reasoning techniques (e.g., Chain-of-Thought, Reflection) significantly underperform in this challenging setting. To address this, we propose Verbalization of Path (VoP), which explicitly grounds the agent's internal reasoning by probing an explicit cognitive map (key landmarks and directions toward the destination) from the MLLMs, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' sequential decision-making in real-world navigation without environmental annotations
Assessing multimodal reasoning for city navigation using only visual inputs and internal knowledge
Addressing performance gaps in knowledge-intensive navigation tasks through explicit cognitive mapping
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Verbalization of Path (VoP) for explicit cognitive mapping
Uses raw visual inputs and internal multimodal reasoning for navigation
Evaluates agents in real-world city environments without annotations
🔎 Similar Papers
No similar papers found.