CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) excel in explicit instruction-following navigation but struggle to infer implicit human needs—e.g., “I’m thirsty”—in urban embodied settings. To address this gap, we introduce CitySeeker, the first benchmark for implicit-demand urban embodied navigation, covering 8 cities, 6,440 trajectories, and 7 target场景 types; it systematically exposes three fundamental bottlenecks of VLMs in long-horizon implicit navigation. We propose BCR—a human spatial-cognition-inspired exploration strategy incorporating backtracking, spatial-cognition augmentation, and memory-augmented retrieval—and design a VLM evaluation framework integrating spatial modeling and trajectory simulation. Experiments show that state-of-the-art VLMs (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task success; BCR significantly mitigates error accumulation and spatial reasoning deficits. Our work establishes a cognitively grounded, deployable pathway for “last-mile” implicit-demand navigation.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.
Problem

Research questions and friction points this paper is trying to address.

Assesses VLMs' spatial reasoning for implicit human needs
Introduces CitySeeker benchmark with urban trajectories and scenarios
Identifies bottlenecks in long-horizon reasoning and spatial cognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CitySeeker benchmark for implicit needs navigation
Proposes BCR strategies for spatial reasoning enhancement
Addresses error accumulation and spatial cognition bottlenecks
🔎 Similar Papers
No similar papers found.
S
Siqi Wang
Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
C
Chao Liang
Nanjing Institute of Geography and Limnology, Chinese Academy of Sciences, Nanjing, China
Y
Yunfan Gao
Tongji University, Shanghai, China
Erxin Yu
Erxin Yu
The Hong Kong Polytechnic University
S
Sen Li
Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
Y
Yushi Li
Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
J
Jing Li
Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
Haofen Wang
Haofen Wang
Tongji University
Knowledge GraphNatural Language ProcessingRetrieval Augmented Generation