Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study investigates whether vision-language models (VLMs) approximate human perception in urban street scene understanding—specifically, their capacity to model both objective physical attributes and subjective impressions (e.g., “safety”, “vitality”). To this end, we introduce Montreal-100, the first city perception benchmark integrating multi-group subjective annotations alongside objective attribute labels. We evaluate zero-shot VLMs using structured prompting and deterministic parsing, measuring performance via accuracy, macro F1 (up to 0.31), Jaccard overlap (up to 0.48), and Krippendorff’s alpha for inter-annotator agreement. Results reveal that current VLMs substantially underperform humans on subjective impression prediction, with model accuracy strongly correlated with human annotation consistency; synthetic imagery induces only marginal degradation. Montreal-100 is the first benchmark enabling uncertainty-aware, reproducible urban intelligence analysis, establishing a novel paradigm for participatory urban research.

Technology Category

Application Category

📝 Abstract

Understanding how people read city scenes can inform design and planning. We introduce a small benchmark for testing vision-language models (VLMs) on urban perception using 100 Montreal street images, evenly split between photographs and photorealistic synthetic scenes. Twelve participants from seven community groups supplied 230 annotation forms across 30 dimensions mixing physical attributes and subjective impressions. French responses were normalized to English. We evaluated seven VLMs in a zero-shot setup with a structured prompt and deterministic parser. We use accuracy for single-choice items and Jaccard overlap for multi-label items; human agreement uses Krippendorff's alpha and pairwise Jaccard. Results suggest stronger model alignment on visible, objective properties than subjective appraisals. The top system (claude-sonnet) reaches macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human agreement coincides with better model scores. Synthetic images slightly lower scores. We release the benchmark, prompts, and harness for reproducible, uncertainty-aware evaluation in participatory urban analysis.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vision-language models on urban perception tasks

Assessing model alignment with human subjective appraisals

Comparing performance on objective versus subjective urban attributes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Urban perception benchmark using street images

Evaluated seven VLMs with structured prompts

Combined physical attributes with subjective impressions

🔎 Similar Papers

Urban Safety Perception Assessments via Integrating Multimodal Large Language Models with Street View Images