CityCube: Benchmarking Cross-view Spatial Reasoning on Vision-Language Models in Urban Environments

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of cross-view spatial reasoning capabilities in existing vision-language models within open urban environments. To this end, we introduce CityCube, a novel benchmark that establishes the first comprehensive evaluation framework encompassing five cognitive dimensions and three types of spatial relationships. CityCube integrates multi-platform perspectives—including ground vehicles, drones, and satellites—and incorporates four distinct camera motion patterns. Based on 5,022 meticulously annotated multi-view question-answer pairs, we conduct a systematic assessment of 33 vision-language models. Our findings reveal that even the largest current models achieve only 54.1% accuracy, substantially lagging behind human performance by 34.2 percentage points. Notably, domain-specific fine-tuned models surpass 60.0% accuracy, underscoring the critical importance of tailored training for spatial reasoning in urban contexts.

Technology Category

Application Category

📝 Abstract
Cross-view spatial reasoning is essential for embodied AI, underpinning spatial understanding, mental simulation and planning in complex environments. Existing benchmarks primarily emphasize indoor or street settings, overlooking the unique challenges of open-ended urban spaces characterized by rich semantics, complex geometries, and view variations. To address this, we introduce CityCube, a systematic benchmark designed to probe cross-view reasoning capabilities of current VLMs in urban settings. CityCube integrates four viewpoint dynamics to mimic camera movements and spans a wide spectrum of perspectives from multiple platforms, e.g., vehicles, drones and satellites. For a comprehensive assessment, it features 5,022 meticulously annotated multi-view QA pairs categorized into five cognitive dimensions and three spatial relation expressions. A comprehensive evaluation of 33 VLMs reveals a significant performance disparity with humans: even large-scale models struggle to exceed 54.1% accuracy, remaining 34.2% below human performance. By contrast, small-scale fine-tuned VLMs achieve over 60.0% accuracy, highlighting the necessity of our benchmark. Further analyses indicate the task correlations and fundamental cognitive disparity between VLMs and human-like reasoning.
Problem

Research questions and friction points this paper is trying to address.

cross-view spatial reasoning
urban environments
vision-language models
benchmark
spatial understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-view spatial reasoning
vision-language models
urban benchmark
multi-platform viewpoints
spatial cognition
H
Haotian Xu
College of Systems Engineering, National University of Defense Technology; State Key Laboratory of Digital Intelligent Modeling and Simulation
Y
Yue Hu
College of Systems Engineering, National University of Defense Technology; State Key Laboratory of Digital Intelligent Modeling and Simulation
Z
Zhengqiu Zhu
College of Systems Engineering, National University of Defense Technology; State Key Laboratory of Digital Intelligent Modeling and Simulation
Chen Gao
Chen Gao
BNRist, Tsinghua University
Data MiningLLM AgentEmbodied AI
Z
Ziyou Wang
Department of Electronic Engineering, Tsinghua University
J
Junreng Rao
College of Systems Engineering, National University of Defense Technology; State Key Laboratory of Digital Intelligent Modeling and Simulation
Wenhao Lu
Wenhao Lu
Mirosoft
AIMLCVNLP
W
Weishi Li
College of Systems Engineering, National University of Defense Technology; State Key Laboratory of Digital Intelligent Modeling and Simulation
Q
Quanjun Yin
College of Systems Engineering, National University of Defense Technology; State Key Laboratory of Digital Intelligent Modeling and Simulation
Yong Li
Yong Li
Professor, Electronic Engineering, Tsinghua University
Urban ScienceData MiningAI for Science