🤖 AI Summary
Remote sensing time-series analysis faces challenges of fragmented multi-task modeling and difficulty in unifying spatiotemporal feature representation. To address this, we propose the first general-purpose generative framework supporting reconstruction, cloud removal, change detection, and forecasting. Built upon the flow-matching paradigm, our architecture integrates a diffusion-based Transformer with two novel components: an Adaptive Conditional Injector (ACor) and a Spatiotemporal-aware Modulator (STM), enabling joint modeling of multimodal conditional awareness and long-range spatiotemporal dependencies. Extensive experiments demonstrate significant superiority over state-of-the-art methods under challenging scenarios—including severe cloud contamination, missing modalities, and phenological forecasting. Furthermore, we release two high-quality multimodal remote sensing time-series datasets—TS-S12 and TS-S12CR—establishing new benchmarks and paving the way for unified time-series modeling in remote sensing.
📝 Abstract
One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model's conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.