CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods exhibit limited generalizability and insufficient reasoning transparency when leveraging large vision-language models (LVLMs) for accurate, interpretable prediction of urban socioeconomic indicators from street-level and satellite imagery. This paper introduces the first reinforcement learning–based framework for urban socioeconomic reasoning: it employs a semantically aware, verifiable reward function to guide LVLMs toward discriminative visual cues, enabling structured, goal-directed cross-modal reasoning—without requiring human-annotated reasoning traces and relying solely on image–indicator pairs for end-to-end optimization. Experiments across multiple cities and unseen socioeconomic indicators demonstrate significant improvements in prediction accuracy, cross-domain generalization, and interpretability of decision rationales. The proposed framework establishes a general, robust, and auditable paradigm for urban sensing.

Technology Category

Application Category

📝 Abstract
Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce extbf{CityRiSE}, a novel framework for extbf{R}eason extbf{i}ng urban extbf{S}ocio- extbf{E}conomic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.
Problem

Research questions and friction points this paper is trying to address.

Improving urban socio-economic prediction accuracy from visual data
Enabling interpretable reasoning in vision-language models via reinforcement learning
Enhancing generalization for socio-economic status across diverse urban contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning optimizes vision-language models
Reward design focuses on meaningful visual cues
Emergent reasoning improves accuracy and generalization
Tianhui Liu
Tianhui Liu
Hong Kong University of Science and Technology (Guangzhou), Tsinghua University
Large Language ModelUrban ScienceSpatial Intelligence
H
Hetian Pang
Department of Electronic Engineering, BNRist, Tsinghua University
X
Xin Zhang
Department of Electronic Engineering, BNRist, Tsinghua University
J
Jie Feng
Department of Electronic Engineering, BNRist, Tsinghua University
Y
Yong Li
Department of Electronic Engineering, BNRist, Tsinghua University
Pan Hui
Pan Hui
Chair Professor, Nokia Chair in Data Science, FREng & IEEE Fellow (HKUST & University of Helsinki)
Ubiquitous ComputingMobile ComputingAugmented RealityData Science#UnivHelsinkiCS