🤖 AI Summary
This work addresses the significant challenges posed by remote sensing visual localization, where large image scales and ambiguous semantics heavily rely on positional cues, thereby demanding advanced spatial reasoning capabilities from multimodal large models. To this end, the authors propose a position-aware reasoning-guided post-training framework that innovatively integrates synthetic data-driven Chain-of-Thought supervised fine-tuning (CoT-SFT) with Reinforcement Fine-Tuning (RFT). The approach introduces a distance-based positional reward function and a spatial consistency optimization strategy to enhance model performance. Experimental results demonstrate that the proposed method substantially improves both accuracy and stability in target localization within remote sensing imagery, achieving state-of-the-art performance across multiple benchmarks and exhibiting strong generalization capabilities.
📝 Abstract
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbf{RSGround-R1}, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward accurate localization. Moreover, to mitigate incoherent localization behaviors across rollouts, we introduce a spatial consistency guided optimization scheme that dynamically adjusts policy updates based on their spatial coherence, ensuring stable and robust convergence. Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of our model.