🤖 AI Summary
This work addresses the common disconnect in existing remote sensing approaches that treat change understanding and future prediction as isolated tasks, hindering cross-task knowledge transfer. To bridge this gap, the authors propose the first unified world model that jointly models spatiotemporal change comprehension and text-guided future scene generation in remote sensing. They also introduce RSWBench-1.1M, a large-scale language-annotated dataset comprising 1.1 million samples. Through a three-stage training paradigm—Geography-Aware Generative Pretraining (GAGP), Synergistic Instruction Tuning (SIT), and Verifiable Reinforcement Optimization (VRO)—the model achieves state-of-the-art performance with only 2 billion parameters, outperforming open-source models up to 120 times larger on most spatiotemporal change question-answering benchmarks. Moreover, it attains a Fréchet Inception Distance (FID) of 43.13 in future prediction, surpassing all existing open-source models and even closed-source counterparts such as Gemini-2.5-Flash.
📝 Abstract
Remote sensing world models aim to both explain observed changes and forecast plausible futures, two tasks that share spatiotemporal priors. Existing methods, however, typically address them separately, limiting cross-task transfer. We present RS-WorldModel, a unified world model for remote sensing that jointly handles spatiotemporal change understanding and text-guided future scene forecasting, and we build RSWBench-1.1M, a 1.1 million sample dataset with rich language annotations covering both tasks. RS-WorldModel is trained in three stages: (1) Geo-Aware Generative Pre-training (GAGP) conditions forecasting on geographic and acquisition metadata; (2) synergistic instruction tuning (SIT) jointly trains understanding and forecasting; (3) verifiable reinforcement optimization (VRO) refines outputs with verifiable, task-specific rewards. With only 2B parameters, RS-WorldModel surpasses open-source models up to 120$ \times $ larger on most spatiotemporal change question-answering metrics. It achieves an FID of 43.13 on text-guided future scene forecasting, outperforming all open-source baselines as well as the closed-source Gemini-2.5-Flash Image (Nano Banana).