Towards General Urban Monitoring with Vision-Language Models: A Review, Evaluation, and a Research Agenda

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Urban public infrastructure monitoring faces challenges due to high heterogeneity in monitored objects, environments, and contextual conditions; existing IoT-plus-manual-inspection approaches suffer from high cost, poor scalability, and misalignment with human visual perception. Method: We propose the first vision-language model (VLM)-based research framework for zero-shot urban monitoring, systematically analyzing 32 studies (2021–2025) via the PRISMA methodology to characterize application paradigms, prevailing architectures, data resources, and evaluation protocols for VLMs in urban settings. Contribution/Results: Our framework enables zero-shot, cross-task and cross-object state understanding and assessment without fine-tuning—endowing machines with human-like visual cognition. Empirical results demonstrate significant improvements in monitoring generalizability and alignment with citizen visual perception, establishing a novel, low-cost, and scalable paradigm for intelligent urban governance.

Technology Category

Application Category

📝 Abstract

Urban monitoring of public infrastructure (such as waste bins, road signs, vegetation, sidewalks, and construction sites) poses significant challenges due to the diversity of objects, environments, and contextual conditions involved. Current state-of-the-art approaches typically rely on a combination of IoT sensors and manual inspections, which are costly, difficult to scale, and often misaligned with citizens' perception formed through direct visual observation. This raises a critical question: Can machines now "see" like citizens and infer informed opinions about the condition of urban infrastructure? Vision-Language Models (VLMs), which integrate visual understanding with natural language reasoning, have recently demonstrated impressive capabilities in processing complex visual information, turning them into a promising technology to address this challenge. This systematic review investigates the role of VLMs in urban monitoring, with particular emphasis on zero-shot applications. Following the PRISMA methodology, we analyzed 32 peer-reviewed studies published between 2021 and 2025 to address four core research questions: (1) What urban monitoring tasks have been effectively addressed using VLMs? (2) Which VLM architectures and frameworks are most commonly used and demonstrate superior performance? (3) What datasets and resources support this emerging field? (4) How are VLM-based applications evaluated, and what performance levels have been reported?

Problem

Research questions and friction points this paper is trying to address.

Developing vision-language models for urban infrastructure monitoring

Evaluating zero-shot VLM applications in diverse urban environments

Assessing VLM architectures and datasets for public infrastructure analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models for urban monitoring tasks

Zero-shot applications without task-specific training

Integrating visual understanding with natural language reasoning

🔎 Similar Papers

UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction