Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) lack explicit spatial reasoning capabilities, rely heavily on costly manual annotations, and struggle to integrate visual cues with geographic prior knowledge for geospatial reasoning. Method: We propose Geo-R1—the first inference-centric post-training framework for VLMs—featuring: (1) a novel “reasoning guidance → reasoning enhancement” two-stage paradigm, shifting geospatial modeling from pretraining/supervised fine-tuning to inference-oriented post-training; (2) verifiable and scalable reward signals derived from synthetic chain-of-thought samples and weakly supervised cross-view contrastive proxies; and (3) joint optimization via supervised fine-tuning and GRPO-based reinforcement learning to align cross-modal features and refine reasoning. Results: Geo-R1 achieves state-of-the-art performance across multiple geospatial reasoning benchmarks, significantly improving generalization and reasoning capability in zero-annotation settings. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract
We introduce Geo-R1, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models by combining thinking scaffolding and elevating. In the scaffolding stage, Geo-R1 instills a ``geospatial thinking paradigm" via supervised fine-tuning on synthetic chain-of-thought exemplars, enabling models to connect visual cues with geographic priors without costly human reasoning annotations. In the elevating stage, it uses GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy. This design supplies a verifiable and scalable reward signal: teaching models to capture and reconcile features across modalities, and harnessing reasoning for accurate prediction. Geo-R1 extends geospatial modeling from domain pretraining / supervised finetuning to reasoning-first post-training, and achieves state-of-the-art performance across various geospatial reasoning benchmarks. Our model is available at https://huggingface.co/miniHui/Geo-R1.
Problem

Research questions and friction points this paper is trying to address.

Unlocking geospatial reasoning in vision-language models through cross-view reinforcement learning
Teaching models to connect visual cues with geographic priors without human annotations
Enhancing cross-modal feature reconciliation for accurate geospatial predictions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-view reinforcement learning for geospatial reasoning
Synthetic chain-of-thought training without human annotations
Weakly-supervised cross-view pairing with verifiable reward signals
Chenhui Xu
Chenhui Xu
PhD Student, University at Buffalo
Machine LearningAI for Science
Fuxun Yu
Fuxun Yu
Principal Research Manager, Microsoft
Artificial IntelligencePerformance OptimizationInterpretability
M
Michael J. Bianco
Microsoft
J
Jacob Kovarskiy
Microsoft
Raphael Tang
Raphael Tang
Microsoft
machine learningnatural language processingmultimodalityinformation retrieval
Q
Qi Zhang
Microsoft
Z
Zirui Xu
Microsoft
Will LeVine
Will LeVine
Microsoft
AIMachine LearningDeep Learning
B
Brandon Dubbs
Microsoft
H
Heming Liao
Microsoft
C
Cassandra Burgess
Microsoft
S
Suvam Bag
Microsoft
J
Jay Patravali
Microsoft
R
Rupanjali Kukal
Microsoft
M
Mikael Figueroa
Microsoft
R
Rishi Madhok
Microsoft
Nikolaos Karianakis
Nikolaos Karianakis
Microsoft
Artificial IntelligenceDeep LearningComputer Vision
Jinjun Xiong
Jinjun Xiong
University at Buffalo
AISystemsEnergyDesign Automation