🤖 AI Summary
Urban multimodal understanding lacks a unified framework, with existing approaches largely confined to unimodal processing or task-specific designs. To address this gap, we propose UniCity—the first multimodal large language model tailored for urban intelligence. Our method comprises three key components: (1) constructing a comprehensive urban instruction dataset covering both unimodal and cross-modal tasks; (2) introducing a multi-stage training paradigm that enhances spatial reasoning and decouples domain-specific knowledge; and (3) integrating a vision encoder, large language model, and geospatial priors to enable joint modeling—from local scenes to city-scale contexts. Evaluated across three diverse metropolitan areas, UniCity consistently outperforms leading open-source and commercial multimodal LLMs on unimodal recognition, cross-modal reasoning, and cross-city generalization. The code and dataset are publicly released.
📝 Abstract
Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $ extit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $ extit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $ extit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $ extit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.