CityCube: Benchmarking Cross-view Spatial Reasoning on Vision-Language Models in Urban Environments

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of cross-view spatial reasoning capabilities in existing vision-language models within open urban environments. To this end, we introduce CityCube, a novel benchmark that establishes the first comprehensive evaluation framework encompassing five cognitive dimensions and three types of spatial relationships. CityCube integrates multi-platform perspectives—including ground vehicles, drones, and satellites—and incorporates four distinct camera motion patterns. Based on 5,022 meticulously annotated multi-view question-answer pairs, we conduct a systematic assessment of 33 vision-language models. Our findings reveal that even the largest current models achieve only 54.1% accuracy, substantially lagging behind human performance by 34.2 percentage points. Notably, domain-specific fine-tuned models surpass 60.0% accuracy, underscoring the critical importance of tailored training for spatial reasoning in urban contexts.

Technology Category

Application Category

📝 Abstract

Cross-view spatial reasoning is essential for embodied AI, underpinning spatial understanding, mental simulation and planning in complex environments. Existing benchmarks primarily emphasize indoor or street settings, overlooking the unique challenges of open-ended urban spaces characterized by rich semantics, complex geometries, and view variations. To address this, we introduce CityCube, a systematic benchmark designed to probe cross-view reasoning capabilities of current VLMs in urban settings. CityCube integrates four viewpoint dynamics to mimic camera movements and spans a wide spectrum of perspectives from multiple platforms, e.g., vehicles, drones and satellites. For a comprehensive assessment, it features 5,022 meticulously annotated multi-view QA pairs categorized into five cognitive dimensions and three spatial relation expressions. A comprehensive evaluation of 33 VLMs reveals a significant performance disparity with humans: even large-scale models struggle to exceed 54.1% accuracy, remaining 34.2% below human performance. By contrast, small-scale fine-tuned VLMs achieve over 60.0% accuracy, highlighting the necessity of our benchmark. Further analyses indicate the task correlations and fundamental cognitive disparity between VLMs and human-like reasoning.

Problem

Research questions and friction points this paper is trying to address.

cross-view spatial reasoning

urban environments

vision-language models

benchmark

spatial understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-view spatial reasoning

vision-language models

urban benchmark