UrbanDiT: A Foundation Model for Open-World Urban Spatio-Temporal Learning

📅 2024-11-19

🏛️ arXiv.org

📈 Citations: 13

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Modeling spatiotemporal dynamics in open urban environments remains challenging due to heterogeneous data formats (e.g., grids, graphs) and diverse downstream tasks. Method: This paper introduces the first urban spatiotemporal foundation model, featuring a dual-path prompt learning framework—data-driven and task-adaptive—that unifies multi-source, heterogeneous spatiotemporal inputs. It innovatively integrates diffusion Transformers, spatiotemporal serialization encoding, dynamic masking, and hierarchical prompt generation to enable multi-format sequence modeling, joint multi-task training, and zero-shot cross-city/cross-task generalization. Contribution/Results: Evaluated on five urban forecasting tasks—including traffic flow, human mobility, and taxi demand—across multiple cities, the model achieves state-of-the-art performance. Its zero-shot transfer capability significantly surpasses mainstream supervised baselines, establishing a scalable, general-purpose foundation architecture for urban intelligence.

Technology Category

Application Category

📝 Abstract

The urban environment is characterized by complex spatio-temporal dynamics arising from diverse human activities and interactions. Effectively modeling these dynamics is essential for understanding and optimizing urban systems In this work, we introduce UrbanDiT, a foundation model for open-world urban spatio-temporal learning that successfully scale up diffusion transformers in this field. UrbanDiT pioneers a unified model that integrates diverse spatio-temporal data sources and types while learning universal spatio-temporal patterns across different cities and scenarios. This allows the model to unify both multi-data and multi-task learning, and effectively support a wide range of spatio-temporal applications. Its key innovation lies in the elaborated prompt learning framework, which adaptively generates both data-driven and task-specific prompts, guiding the model to deliver superior performance across various urban applications. UrbanDiT offers three primary advantages: 1) It unifies diverse data types, such as grid-based and graph-based data, into a sequential format, allowing to capture spatio-temporal dynamics across diverse scenarios of different cities; 2) With masking strategies and task-specific prompts, it supports a wide range of tasks, including bi-directional spatio-temporal prediction, temporal interpolation, spatial extrapolation, and spatio-temporal imputation; and 3) It generalizes effectively to open-world scenarios, with its powerful zero-shot capabilities outperforming nearly all baselines with training data. These features allow UrbanDiT to achieves state-of-the-art performance in different domains such as transportation traffic, crowd flows, taxi demand, bike usage, and cellular traffic, across multiple cities and tasks. UrbanDiT sets up a new benchmark for foundation models in the urban spatio-temporal domain.

Problem

Research questions and friction points this paper is trying to address.

Modeling complex urban spatio-temporal dynamics from human activities

Unifying diverse data sources and types for universal pattern learning

Adapting to open-world scenarios with zero-shot generalization capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scales diffusion transformers for urban spatiotemporal learning

Unifies diverse data sources into sequential format

Uses adaptive prompt learning for multi-task applications

🔎 Similar Papers

No similar papers found.