UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction

📅 2024-03-25

📈 Citations: 1

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing urban forecasting models rely heavily on satellite imagery while neglecting fine-grained structural details (e.g., building morphology), and text descriptions generated by vision–language models such as UrbanCLIP suffer from low fidelity, undermining the reliability of socioeconomic indicator prediction. To address these limitations, we propose a multi-granularity vision–language joint pretraining framework that synergistically integrates macro-scale (satellite) and micro-scale (street-view) imagery. Our method introduces an LLM-driven controllable text generation module coupled with confidence calibration to mitigate hallucination and textual homogenization in large language models. It comprises multi-granularity visual encoding, dual-objective (contrastive and generative) image–text alignment, and end-to-end joint fine-tuning. Evaluated on six urban socioeconomic prediction tasks, our approach consistently outperforms state-of-the-art methods, achieving substantial average performance gains while significantly improving textual description quality and prediction robustness.

Technology Category

Application Category

📝 Abstract

Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes using data-driven methods. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place. Secondly, the text generated by the precursor work UrbanCLIP, which fully utilizes the extensive knowledge of LLMs, frequently exhibits issues such as hallucination and homogenization, resulting in a lack of reliable quality. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, providing a robust guarantee for producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six socioeconomic indicator prediction tasks underscore its superior performance.

Problem

Research questions and friction points this paper is trying to address.

Satellite Image Detail

UrbanCLIP Model Accuracy

Urban Prediction Reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale Image Integration

Textual Data Processing

Urban Condition Prediction

🔎 Similar Papers

Urban Safety Perception Assessments via Integrating Multimodal Large Language Models with Street View Images