UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

📅 2025-06-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Urban multimodal understanding lacks a unified framework, with existing approaches largely confined to unimodal processing or task-specific designs. To address this gap, we propose UniCity—the first multimodal large language model tailored for urban intelligence. Our method comprises three key components: (1) constructing a comprehensive urban instruction dataset covering both unimodal and cross-modal tasks; (2) introducing a multi-stage training paradigm that enhances spatial reasoning and decouples domain-specific knowledge; and (3) integrating a vision encoder, large language model, and geospatial priors to enable joint modeling—from local scenes to city-scale contexts. Evaluated across three diverse metropolitan areas, UniCity consistently outperforms leading open-source and commercial multimodal LLMs on unimodal recognition, cross-modal reasoning, and cross-city generalization. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $ extit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $ extit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $ extit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $ extit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.
Problem

Research questions and friction points this paper is trying to address.

Lack unified framework for multi-modal urban data processing
Need spatial reasoning and domain knowledge in urban tasks
Absence comprehensive benchmark for MLLMs in urban research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal urban instruction dataset curation
Multi-stage training framework decoupling
Extended benchmark for urban tasks
🔎 Similar Papers
No similar papers found.
J
Jie Feng
Department of Electronic Engineering, BNRist, Tsinghua University, Beijing, China
Shengyuan Wang
Shengyuan Wang
Tsinghua University
Tianhui Liu
Tianhui Liu
Hong Kong University of Science and Technology (Guangzhou), Tsinghua University
Large Language ModelUrban ScienceSpatial Intelligence
Yanxin Xi
Yanxin Xi
univeristy of helsinki
data mining
Y
Yong Li
Department of Electronic Engineering, BNRist, Tsinghua University, Beijing, China