CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Current vision-language models (VLMs) excel in explicit instruction-following navigation but struggle to infer implicit human needs—e.g., “I’m thirsty”—in urban embodied settings. To address this gap, we introduce CitySeeker, the first benchmark for implicit-demand urban embodied navigation, covering 8 cities, 6,440 trajectories, and 7 target场景 types; it systematically exposes three fundamental bottlenecks of VLMs in long-horizon implicit navigation. We propose BCR—a human spatial-cognition-inspired exploration strategy incorporating backtracking, spatial-cognition augmentation, and memory-augmented retrieval—and design a VLM evaluation framework integrating spatial modeling and trajectory simulation. Experiments show that state-of-the-art VLMs (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task success; BCR significantly mitigates error accumulation and spatial reasoning deficits. Our work establishes a cognitively grounded, deployable pathway for “last-mile” implicit-demand navigation.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.

Problem

Research questions and friction points this paper is trying to address.

Assesses VLMs' spatial reasoning for implicit human needs

Introduces CitySeeker benchmark with urban trajectories and scenarios

Identifies bottlenecks in long-horizon reasoning and spatial cognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CitySeeker benchmark for implicit needs navigation

Proposes BCR strategies for spatial reasoning enhancement

Addresses error accumulation and spatial cognition bottlenecks

🔎 Similar Papers

Advances in Embodied Navigation Using Large Language Models: A Survey