UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Urban cultural and architectural characteristics vary across geography, history, and sociopolitical contexts, yet traditional analyses rely on expert interpretation, lacking standardization and scalability. Method: We propose UrbanSense—the first quantitative street-scene analysis framework based on vision-language models (VLMs)—and introduce UrbanDiffBench, the first benchmark dataset enabling automated, cross-city and cross-temporal stylistic modeling and evolutionary comparison. Our approach integrates multimodal representation learning, style embedding generation, statistical significance testing (two-sample t-test), and human-centered evaluation via Phi coefficient. Contribution/Results: Experiments show that over 80% of generated stylistic descriptions achieve statistical significance (p < 0.05); Phi coefficients reach 0.912 (inter-city) and 0.833 (inter-temporal), confirming robust stylistic discrimination and interpretability. This work establishes a reproducible, scalable, and quantitative paradigm for urban cultural research.

Technology Category

Application Category

📝 Abstract
Urban cultures and architectural styles vary significantly across cities due to geographical, chronological, historical, and socio-political factors. Understanding these differences is essential for anticipating how cities may evolve in the future. As representative cases of historical continuity and modern innovation in China, Beijing and Shenzhen offer valuable perspectives for exploring the transformation of urban streetscapes. However, conventional approaches to urban cultural studies often rely on expert interpretation and historical documentation, which are difficult to standardize across different contexts. To address this, we propose a multimodal research framework based on vision-language models, enabling automated and scalable analysis of urban streetscape style differences. This approach enhances the objectivity and data-driven nature of urban form research. The contributions of this study are as follows: First, we construct UrbanDiffBench, a curated dataset of urban streetscapes containing architectural images from different periods and regions. Second, we develop UrbanSense, the first vision-language-model-based framework for urban streetscape analysis, enabling the quantitative generation and comparison of urban style representations. Third, experimental results show that Over 80% of generated descriptions pass the t-test (p less than 0.05). High Phi scores (0.912 for cities, 0.833 for periods) from subjective evaluations confirm the method's ability to capture subtle stylistic differences. These results highlight the method's potential to quantify and interpret urban style evolution, offering a scientifically grounded lens for future design.
Problem

Research questions and friction points this paper is trying to address.

Analyzing urban streetscape differences across cities objectively
Standardizing urban cultural studies with automated scalable methods
Quantifying urban style evolution using vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language models for urban streetscape analysis
Automated quantitative urban style representation
High accuracy in capturing stylistic differences
🔎 Similar Papers
No similar papers found.
J
Jun Yin
Tsinghua University
J
Jing Zhong
Tsinghua University
Peilin Li
Peilin Li
National University of Singapore
Machine LearningArchitectureGenerative Design
Pengyu Zeng
Pengyu Zeng
清华大学
人工智能、深度学习
M
Miao Zhang
Tsinghua University
R
Ran Luo
South China University of Technology
S
Shuai Lu
Tsinghua University