How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

General-purpose vision-language models (VLMs) exhibit limited capability in urban street-scene spatial reasoning—e.g., object localization, layout understanding, and depth inference—due to the scarcity of real-world annotated data. Method: We formally introduce “urban spatial reasoning” as a novel task and construct the first synthetic VQA dataset for it. Leveraging segmentation, depth, and detection annotations derived from street-scene images, we generate Chain-of-Thought (CoT)-enabled synthetic question-answer pairs, explicitly covering challenging negation and counterfactual reasoning. Contribution/Results: Fine-tuning BLIP-2, InstructBLIP, and LLaVA-1.5 solely on this synthetic dataset yields substantial improvements: zero-shot and fine-tuned performance increases by 12.6–18.3% average accuracy on complex spatial reasoning queries. These results demonstrate that synthetic CoT-augmented data effectively enhances VLMs’ spatial understanding in urban environments and exhibits strong generalization to real-world reasoning tasks.

Technology Category

Application Category

📝 Abstract

Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' spatial reasoning on urban street-view images

Assessing zero-shot transfer of pretrained models to urban domains

Improving VLM performance with synthetic fine-tuning datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning VLMs with synthetic urban VQA dataset

Using Chain-of-Thought answers for reasoning supervision

Leveraging segmentation, depth, and object detection predictions

🔎 Similar Papers

Urban Safety Perception Assessments via Integrating Multimodal Large Language Models with Street View Images