UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction

📅 2024-03-25
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing urban forecasting models rely heavily on satellite imagery while neglecting fine-grained structural details (e.g., building morphology), and text descriptions generated by vision–language models such as UrbanCLIP suffer from low fidelity, undermining the reliability of socioeconomic indicator prediction. To address these limitations, we propose a multi-granularity vision–language joint pretraining framework that synergistically integrates macro-scale (satellite) and micro-scale (street-view) imagery. Our method introduces an LLM-driven controllable text generation module coupled with confidence calibration to mitigate hallucination and textual homogenization in large language models. It comprises multi-granularity visual encoding, dual-objective (contrastive and generative) image–text alignment, and end-to-end joint fine-tuning. Evaluated on six urban socioeconomic prediction tasks, our approach consistently outperforms state-of-the-art methods, achieving substantial average performance gains while significantly improving textual description quality and prediction robustness.

Technology Category

Application Category

📝 Abstract
Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes using data-driven methods. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place. Secondly, the text generated by the precursor work UrbanCLIP, which fully utilizes the extensive knowledge of LLMs, frequently exhibits issues such as hallucination and homogenization, resulting in a lack of reliable quality. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, providing a robust guarantee for producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six socioeconomic indicator prediction tasks underscore its superior performance.
Problem

Research questions and friction points this paper is trying to address.

Satellite Image Detail
UrbanCLIP Model Accuracy
Urban Prediction Reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale Image Integration
Textual Data Processing
Urban Condition Prediction
🔎 Similar Papers
No similar papers found.