TiMo: Spatiotemporal Foundation Model for Satellite Image Time Series

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing spatiotemporal foundation models struggle to capture multiscale spatiotemporal dynamics of land cover in Satellite Image Time Series (SITS), limiting downstream task performance. To address this, we propose the first SITS-oriented spatiotemporal foundation model, featuring a novel Spatiotemporal Gyro Attention mechanism that jointly models dynamic, multiscale patterns across both spatial and temporal dimensions. We construct MillionST—a large-scale, high-quality SITS pretraining dataset comprising over one million samples—and adopt a hierarchical Vision Transformer architecture jointly optimized via masked image modeling and large-scale geospatial-temporal pretraining. Our model achieves state-of-the-art performance on four key tasks: deforestation monitoring, land cover segmentation, crop classification, and flood detection, demonstrating strong generalization across diverse geographic and temporal domains. All code, pretrained models, and the MillionST dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Satellite image time series (SITS) provide continuous observations of the Earth's surface, making them essential for applications such as environmental management and disaster assessment. However, existing spatiotemporal foundation models rely on plain vision transformers, which encode entire temporal sequences without explicitly capturing multiscale spatiotemporal relationships between land objects. This limitation hinders their effectiveness in downstream tasks. To overcome this challenge, we propose TiMo, a novel hierarchical vision transformer foundation model tailored for SITS analysis. At its core, we introduce a spatiotemporal gyroscope attention mechanism that dynamically captures evolving multiscale patterns across both time and space. For pre-training, we curate MillionST, a large-scale dataset of one million images from 100,000 geographic locations, each captured across 10 temporal phases over five years, encompassing diverse geospatial changes and seasonal variations. Leveraging this dataset, we adapt masked image modeling to pre-train TiMo, enabling it to effectively learn and encode generalizable spatiotemporal representations.Extensive experiments across multiple spatiotemporal tasks-including deforestation monitoring, land cover segmentation, crop type classification, and flood detection-demonstrate TiMo's superiority over state-of-the-art methods. Code, model, and dataset will be released at https://github.com/MiliLab/TiMo.
Problem

Research questions and friction points this paper is trying to address.

Existing models fail to capture multiscale spatiotemporal relationships in satellite images
Need for a specialized model to analyze satellite image time series effectively
Lack of large-scale datasets for pre-training spatiotemporal foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical vision transformer for SITS analysis
Spatiotemporal gyroscope attention mechanism
MillionST dataset for pre-training
🔎 Similar Papers
No similar papers found.