UrbanFM: Scaling Urban Spatio-Temporal Foundation Models

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the lack of generalizable spatiotemporal foundation models in urban computing, where existing approaches are often confined to specific scenarios and struggle to transfer across cities or tasks. To overcome this limitation, we propose UrbanFM—a minimalist self-attention architecture that leverages the billion-scale standardized dataset WorldST and a lightweight spatiotemporal computational unit, MiniST. By integrating data standardization, discrete representations of spatiotemporal fields, and large-scale pretraining, UrbanFM establishes a unified foundation model for urban spatiotemporal learning. Evaluated on EvalST—the largest benchmark to date—UrbanFM achieves zero-shot cross-city generalization for the first time, demonstrating remarkable versatility in modeling unseen cities and tasks without task-specific fine-tuning.

Technology Category

Application Category

📝 Abstract

Urban systems, as dynamic complex systems, continuously generate spatio-temporal data streams that encode the fundamental laws of human mobility and city evolution. While AI for Science has witnessed the transformative power of foundation models in disciplines like genomics and meteorology, urban computing remains fragmented due to"scenario-specific"models, which are overfitted to specific regions or tasks, hindering their generalizability. To bridge this gap and advance spatio-temporal foundation models for urban systems, we adopt scaling as the central perspective and systematically investigate two key questions: what to scale and how to scale. Grounded in first-principles analysis, we identify three critical dimensions: heterogeneity, correlation, and dynamics, aligning these principles with the fundamental scientific properties of urban spatio-temporal data. Specifically, to address heterogeneity through data scaling, we construct WorldST. This billion-scale corpus standardizes diverse physical signals, such as traffic flow and speed, from over 100 global cities into a unified data format. To enable computation scaling for modeling correlations, we introduce the MiniST unit, a novel split mechanism that discretizes continuous spatio-temporal fields into learnable computational units to unify representations of grid-based and sensor-based observations. Finally, addressing dynamics via architecture scaling, we propose UrbanFM, a minimalist self-attention architecture designed with limited inductive biases to autonomously learn dynamic spatio-temporal dependencies from massive data. Furthermore, we establish EvalST, the largest-scale urban spatio-temporal benchmark to date. Extensive experiments demonstrate that UrbanFM achieves remarkable zero-shot generalization across unseen cities and tasks, marking a pivotal first step toward large-scale urban spatio-temporal foundation models.

Problem

Research questions and friction points this paper is trying to address.

urban computing

spatio-temporal foundation models

generalizability

zero-shot generalization

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation model

spatio-temporal scaling

heterogeneity