How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose vision-language models (VLMs) exhibit limited capability in urban street-scene spatial reasoning—e.g., object localization, layout understanding, and depth inference—due to the scarcity of real-world annotated data. Method: We formally introduce “urban spatial reasoning” as a novel task and construct the first synthetic VQA dataset for it. Leveraging segmentation, depth, and detection annotations derived from street-scene images, we generate Chain-of-Thought (CoT)-enabled synthetic question-answer pairs, explicitly covering challenging negation and counterfactual reasoning. Contribution/Results: Fine-tuning BLIP-2, InstructBLIP, and LLaVA-1.5 solely on this synthetic dataset yields substantial improvements: zero-shot and fine-tuned performance increases by 12.6–18.3% average accuracy on complex spatial reasoning queries. These results demonstrate that synthetic CoT-augmented data effectively enhances VLMs’ spatial understanding in urban environments and exhibits strong generalization to real-world reasoning tasks.

Technology Category

Application Category

📝 Abstract
Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' spatial reasoning on urban street-view images
Assessing zero-shot transfer of pretrained models to urban domains
Improving VLM performance with synthetic fine-tuning datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning VLMs with synthetic urban VQA dataset
Using Chain-of-Thought answers for reasoning supervision
Leveraging segmentation, depth, and object detection predictions
🔎 Similar Papers
No similar papers found.
J
Juneyoung Ro
Korea Advanced Institute of Science and Technology
Namwoo Kim
Namwoo Kim
Ph.D candidate, Korea Advanced Institute of Science and Technology
Spatial AnalysisUrban MobilityTransportation
Y
Yoonjin Yoon
Korea Advanced Institute of Science and Technology