🤖 AI Summary
Traditional spatiotemporal models suffer from task-specific design and poor cross-domain generalization. To address this, we propose UniSTD—a unified spatiotemporal modeling framework. Methodologically, UniSTD introduces (1) a vision–language joint pretraining foundation for learning general-purpose representations; (2) a rank-adaptive Mixture-of-Experts (MoE) mechanism with continuous optimization and an explicit temporal module, enabling unified modeling across disciplines and tasks; and (3) a two-stage pretraining–adaptation paradigm integrated with score-interpolation-based continuous MoE, balancing broad generalizability and task-specific adaptability. Evaluated on a comprehensive benchmark comprising 10 spatiotemporal tasks spanning transportation, meteorology, remote sensing, and healthcare, UniSTD achieves, for the first time, single-model concurrent inference across all tasks—eliminating the need for domain-specific model training and substantially reducing multi-domain joint training overhead. This work establishes a new paradigm for general-purpose spatiotemporal intelligence.
📝 Abstract
Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce extbf{UniSTD}, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at https://github.com/1hunters/UniSTD.