🤖 AI Summary
Current spatiotemporal foundation models (STFMs) significantly underperform vision and language foundation models in critical domains such as transportation, public health, and environmental science. Method: This paper introduces the first unified vision framework for STFMs, formally characterizing their defining traits—dynamic geographic modeling, event-driven representation learning, long-horizon temporal dependency capture, and adaptive integration of heterogeneous multi-source data. To address key limitations in transferability, cross-domain generalization, and causal inference, we systematically design a technical pathway integrating spatiotemporal modeling, multi-scale representation learning, and causal reasoning, accompanied by a phased development roadmap. Contribution/Results: The framework delivers actionable methodological support for high-impact applications—including traffic optimization, epidemic forecasting, and climate modeling—and advances spatiotemporal AI from task-specific solutions toward foundational, general-purpose capabilities.
📝 Abstract
Foundation models have revolutionized artificial intelligence, setting new benchmarks in performance and enabling transformative capabilities across a wide range of vision and language tasks. However, despite the prevalence of spatio-temporal data in critical domains such as transportation, public health, and environmental monitoring, spatio-temporal foundation models (STFMs) have not yet achieved comparable success. In this paper, we articulate a vision for the future of STFMs, outlining their essential characteristics and the generalization capabilities necessary for broad applicability. We critically assess the current state of research, identifying gaps relative to these ideal traits, and highlight key challenges that impede their progress. Finally, we explore potential opportunities and directions to advance research towards the aim of effective and broadly applicable STFMs.