🤖 AI Summary
To address the challenges of fusing heterogeneous spatiotemporal data, weak spatiotemporal dependency modeling in large language models (LLMs), and poor generalization due to train-test distribution shift in urban dynamic forecasting, this paper proposes the first LLM-based framework tailored for spatiotemporal prediction. Its core contributions are: (1) Muffin-MAE, a multimodal fusion masked autoencoder enabling multiscale spatiotemporal representation learning; (2) a semantic-aware prompt fine-tuning strategy that enhances alignment between linguistic prompts and spatiotemporal semantics; and (3) a test-time adaptive embedding reconstruction mechanism to mitigate distribution shift. Evaluated on multiple real-world urban datasets, the framework achieves a 23.6% reduction in zero-shot prediction error and improves cross-domain transfer robustness by 41.2%, consistently outperforming state-of-the-art methods.
📝 Abstract
Understanding and predicting urban dynamics is crucial for managing transportation systems, optimizing urban planning, and enhancing public services. While neural network-based approaches have achieved success, they often rely on task-specific architectures and large volumes of data, limiting their ability to generalize across diverse urban scenarios. Meanwhile, Large Language Models (LLMs) offer strong reasoning and generalization capabilities, yet their application to spatial-temporal urban dynamics remains underexplored. Existing LLM-based methods struggle to effectively integrate multifaceted spatial-temporal data and fail to address distributional shifts between training and testing data, limiting their predictive reliability in real-world applications. To bridge this gap, we propose UrbanMind, a novel spatial-temporal LLM framework for multifaceted urban dynamics prediction that ensures both accurate forecasting and robust generalization. At its core, UrbanMind introduces Muffin-MAE, a multifaceted fusion masked autoencoder with specialized masking strategies that capture intricate spatial-temporal dependencies and intercorrelations among multifaceted urban dynamics. Additionally, we design a semantic-aware prompting and fine-tuning strategy that encodes spatial-temporal contextual details into prompts, enhancing LLMs' ability to reason over spatial-temporal patterns. To further improve generalization, we introduce a test time adaptation mechanism with a test data reconstructor, enabling UrbanMind to dynamically adjust to unseen test data by reconstructing LLM-generated embeddings. Extensive experiments on real-world urban datasets across multiple cities demonstrate that UrbanMind consistently outperforms state-of-the-art baselines, achieving high accuracy and robust generalization, even in zero-shot settings.