🤖 AI Summary
Existing remote sensing visual localization methods are largely confined to single-entity perception and matching, lacking explicit reasoning capabilities and the ability to model relationships among multiple entities. To address this limitation, this work proposes ME-RSRG, the first benchmark for multi-entity joint reasoning in remote sensing localization, and introduces an Entity-Aware Reasoning (EAR) framework grounded in vision-language foundation models. EAR pioneers a multi-step reasoning paradigm, integrating an entity-aware reward mechanism with Group Relative Policy Optimization (GRPO) to enable structured, reasoning-trajectory-driven joint localization. Experimental results on ME-RSRG demonstrate that EAR significantly improves localization accuracy and reasoning capability in complex scenes, validating both the effectiveness and the inherent challenges of multi-entity collaborative modeling.
📝 Abstract
Recent advances in reasoning language models and reinforcement learning with verifiable rewards have significantly enhanced multi-step reasoning capabilities. This progress motivates the extension of reasoning paradigms to remote sensing visual grounding task. However, existing remote sensing grounding methods remain largely confined to perception-level matching and single-entity formulations, limiting the role of explicit reasoning and inter-entity modeling. To address this challenge, we introduce a new benchmark dataset for Multi-Entity Reasoning Grounding in Remote Sensing (ME-RSRG). Based on ME-RSRG, we reformulate remote sensing grounding as a multi-entity reasoning task and propose an Entity-Aware Reasoning (EAR) framework built upon visual-linguistic foundation models. EAR generates structured reasoning traces and subject-object grounding outputs. It adopts supervised fine-tuning for cold-start initialization and is further optimized via entity-aware reward-driven Group Relative Policy Optimization (GRPO). Extensive experiments on ME-RSRG demonstrate the challenges of multi-entity reasoning and verify the effectiveness of our proposed EAR framework. Our dataset, code, and models will be available at https://github.com/CV-ShuchangLyu/ME-RSRG.