🤖 AI Summary
Existing world models are predominantly validated in synthetic or constrained environments, exhibiting limited large-scale spatial coverage and insufficient complex semantic reasoning capabilities for real-world remote sensing scenarios. This work introduces RSWISE, the first world modeling framework specifically designed for remote sensing, targeting directional-conditioned spatial extrapolation: generating semantically consistent adjacent image patches given a central image and a directional instruction. Methodologically, we propose the multimodal RemoteBAGEL model, integrating a direction-conditioned generation mechanism with remote sensing–specific pretraining strategies. We further construct the RSWISE benchmark and a GPT-4o–based semantic evaluation protocol. Experiments across four real-world remote sensing domains demonstrate substantial improvements over state-of-the-art methods, achieving breakthroughs in visual fidelity, spatial consistency, and instruction adherence. RSWISE establishes a verifiable foundation for spatial reasoning, enabling practical applications such as disaster response and urban planning.
📝 Abstract
World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations. However, existing approaches are predominantly evaluated in synthetic environments or constrained scene settings, limiting their validation in real-world contexts with broad spatial coverage and complex semantics. Meanwhile, remote sensing applications urgently require spatial reasoning capabilities for disaster response and urban planning. This paper bridges these gaps by introducing the first framework for world modeling in remote sensing. We formulate remote sensing world modeling as direction-conditioned spatial extrapolation, where models generate semantically consistent adjacent image tiles given a central observation and directional instruction. To enable rigorous evaluation, we develop RSWISE (Remote Sensing World-Image Spatial Evaluation), a benchmark containing 1,600 evaluation tasks across four scenarios: general, flood, urban, and rural. RSWISE combines visual fidelity assessment with instruction compliance evaluation using GPT-4o as a semantic judge, ensuring models genuinely perform spatial reasoning rather than simple replication. Afterwards, we present RemoteBAGEL, a unified multimodal model fine-tuned on remote sensing data for spatial extrapolation tasks. Extensive experiments demonstrate that RemoteBAGEL consistently outperforms state-of-the-art baselines on RSWISE.