🤖 AI Summary
Current vision-language models (VLMs) excel in explicit instruction-following navigation but struggle to infer implicit human needs—e.g., “I’m thirsty”—in urban embodied settings. To address this gap, we introduce CitySeeker, the first benchmark for implicit-demand urban embodied navigation, covering 8 cities, 6,440 trajectories, and 7 target场景 types; it systematically exposes three fundamental bottlenecks of VLMs in long-horizon implicit navigation. We propose BCR—a human spatial-cognition-inspired exploration strategy incorporating backtracking, spatial-cognition augmentation, and memory-augmented retrieval—and design a VLM evaluation framework integrating spatial modeling and trajectory simulation. Experiments show that state-of-the-art VLMs (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task success; BCR significantly mitigates error accumulation and spatial reasoning deficits. Our work establishes a cognitively grounded, deployable pathway for “last-mile” implicit-demand navigation.
📝 Abstract
Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.