UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Urban multimodal understanding lacks a unified framework, with existing approaches largely confined to unimodal processing or task-specific designs. To address this gap, we propose UniCity—the first multimodal large language model tailored for urban intelligence. Our method comprises three key components: (1) constructing a comprehensive urban instruction dataset covering both unimodal and cross-modal tasks; (2) introducing a multi-stage training paradigm that enhances spatial reasoning and decouples domain-specific knowledge; and (3) integrating a vision encoder, large language model, and geospatial priors to enable joint modeling—from local scenes to city-scale contexts. Evaluated across three diverse metropolitan areas, UniCity consistently outperforms leading open-source and commercial multimodal LLMs on unimodal recognition, cross-modal reasoning, and cross-city generalization. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $ extit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $ extit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $ extit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $ extit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.

Problem

Research questions and friction points this paper is trying to address.

Lack unified framework for multi-modal urban data processing

Need spatial reasoning and domain knowledge in urban tasks

Absence comprehensive benchmark for MLLMs in urban research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal urban instruction dataset curation

Multi-stage training framework decoupling

Extended benchmark for urban tasks

🔎 Similar Papers

Urban Safety Perception Assessments via Integrating Multimodal Large Language Models with Street View Images