Remote Sensing-Oriented World Model

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing world models are predominantly validated in synthetic or constrained environments, exhibiting limited large-scale spatial coverage and insufficient complex semantic reasoning capabilities for real-world remote sensing scenarios. This work introduces RSWISE, the first world modeling framework specifically designed for remote sensing, targeting directional-conditioned spatial extrapolation: generating semantically consistent adjacent image patches given a central image and a directional instruction. Methodologically, we propose the multimodal RemoteBAGEL model, integrating a direction-conditioned generation mechanism with remote sensing–specific pretraining strategies. We further construct the RSWISE benchmark and a GPT-4o–based semantic evaluation protocol. Experiments across four real-world remote sensing domains demonstrate substantial improvements over state-of-the-art methods, achieving breakthroughs in visual fidelity, spatial consistency, and instruction adherence. RSWISE establishes a verifiable foundation for spatial reasoning, enabling practical applications such as disaster response and urban planning.

Technology Category

Application Category

📝 Abstract
World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations. However, existing approaches are predominantly evaluated in synthetic environments or constrained scene settings, limiting their validation in real-world contexts with broad spatial coverage and complex semantics. Meanwhile, remote sensing applications urgently require spatial reasoning capabilities for disaster response and urban planning. This paper bridges these gaps by introducing the first framework for world modeling in remote sensing. We formulate remote sensing world modeling as direction-conditioned spatial extrapolation, where models generate semantically consistent adjacent image tiles given a central observation and directional instruction. To enable rigorous evaluation, we develop RSWISE (Remote Sensing World-Image Spatial Evaluation), a benchmark containing 1,600 evaluation tasks across four scenarios: general, flood, urban, and rural. RSWISE combines visual fidelity assessment with instruction compliance evaluation using GPT-4o as a semantic judge, ensuring models genuinely perform spatial reasoning rather than simple replication. Afterwards, we present RemoteBAGEL, a unified multimodal model fine-tuned on remote sensing data for spatial extrapolation tasks. Extensive experiments demonstrate that RemoteBAGEL consistently outperforms state-of-the-art baselines on RSWISE.
Problem

Research questions and friction points this paper is trying to address.

Developing world models for remote sensing spatial reasoning applications
Creating framework for direction-conditioned spatial extrapolation of image tiles
Establishing benchmark to evaluate spatial reasoning in remote sensing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direction-conditioned spatial extrapolation for remote sensing
RSWISE benchmark with 1,600 tasks across scenarios
RemoteBAGEL multimodal model fine-tuned for extrapolation
🔎 Similar Papers
No similar papers found.
Y
Yuxi Lu
University of Technology Sydney (UTS), Sydney, Australia
B
Biao Wu
University of Technology Sydney (UTS), Sydney, Australia
Zhidong Li
Zhidong Li
UTS
Machine LearningData science
K
Kunqi Li
University of Technology Sydney (UTS), Sydney, Australia
C
Chenya Huang
University of Technology Sydney (UTS), Sydney, Australia
H
Huacan Wang
University of Chinese Academy of Sciences (UCAS), Beijing, China
Qizhen Lan
Qizhen Lan
UTHealth Houston
Computer VisionKnowledge DistillationObject detectionMedical ImagingStatistical Modeling
R
Ronghao Chen
Peking University, Beijing, China
L
Ling Chen
University of Technology Sydney (UTS), Sydney, Australia
B
Bin Liang
University of Technology Sydney (UTS), Sydney, Australia