RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the significant challenges posed by remote sensing visual localization, where large image scales and ambiguous semantics heavily rely on positional cues, thereby demanding advanced spatial reasoning capabilities from multimodal large models. To this end, the authors propose a position-aware reasoning-guided post-training framework that innovatively integrates synthetic data-driven Chain-of-Thought supervised fine-tuning (CoT-SFT) with Reinforcement Fine-Tuning (RFT). The approach introduces a distance-based positional reward function and a spatial consistency optimization strategy to enhance model performance. Experimental results demonstrate that the proposed method substantially improves both accuracy and stability in target localization within remote sensing imagery, achieving state-of-the-art performance across multiple benchmarks and exhibiting strong generalization capabilities.

Technology Category

Application Category

📝 Abstract

Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbf{RSGround-R1}, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward accurate localization. Moreover, to mitigate incoherent localization behaviors across rollouts, we introduce a spatial consistency guided optimization scheme that dynamically adjusts policy updates based on their spatial coherence, ensuring stable and robust convergence. Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of our model.

Problem

Research questions and friction points this paper is trying to address.

Remote Sensing Visual Grounding

Spatial Reasoning

Multimodal Large Language Models

Positional Cues

Aerial Imagery

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial reasoning

visual grounding

reinforcement fine-tuning