Remote Sensing-Oriented World Model

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing world models are predominantly validated in synthetic or constrained environments, exhibiting limited large-scale spatial coverage and insufficient complex semantic reasoning capabilities for real-world remote sensing scenarios. This work introduces RSWISE, the first world modeling framework specifically designed for remote sensing, targeting directional-conditioned spatial extrapolation: generating semantically consistent adjacent image patches given a central image and a directional instruction. Methodologically, we propose the multimodal RemoteBAGEL model, integrating a direction-conditioned generation mechanism with remote sensing–specific pretraining strategies. We further construct the RSWISE benchmark and a GPT-4o–based semantic evaluation protocol. Experiments across four real-world remote sensing domains demonstrate substantial improvements over state-of-the-art methods, achieving breakthroughs in visual fidelity, spatial consistency, and instruction adherence. RSWISE establishes a verifiable foundation for spatial reasoning, enabling practical applications such as disaster response and urban planning.

Technology Category

Application Category

📝 Abstract

World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations. However, existing approaches are predominantly evaluated in synthetic environments or constrained scene settings, limiting their validation in real-world contexts with broad spatial coverage and complex semantics. Meanwhile, remote sensing applications urgently require spatial reasoning capabilities for disaster response and urban planning. This paper bridges these gaps by introducing the first framework for world modeling in remote sensing. We formulate remote sensing world modeling as direction-conditioned spatial extrapolation, where models generate semantically consistent adjacent image tiles given a central observation and directional instruction. To enable rigorous evaluation, we develop RSWISE (Remote Sensing World-Image Spatial Evaluation), a benchmark containing 1,600 evaluation tasks across four scenarios: general, flood, urban, and rural. RSWISE combines visual fidelity assessment with instruction compliance evaluation using GPT-4o as a semantic judge, ensuring models genuinely perform spatial reasoning rather than simple replication. Afterwards, we present RemoteBAGEL, a unified multimodal model fine-tuned on remote sensing data for spatial extrapolation tasks. Extensive experiments demonstrate that RemoteBAGEL consistently outperforms state-of-the-art baselines on RSWISE.

Problem

Research questions and friction points this paper is trying to address.

Developing world models for remote sensing spatial reasoning applications

Creating framework for direction-conditioned spatial extrapolation of image tiles

Establishing benchmark to evaluate spatial reasoning in remote sensing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direction-conditioned spatial extrapolation for remote sensing

RSWISE benchmark with 1,600 tasks across scenarios

RemoteBAGEL multimodal model fine-tuned for extrapolation

🔎 Similar Papers

FoMo: Multi-Modal, Multi-Scale and Multi-Task Remote Sensing Foundation Models for Forest Monitoring