π€ AI Summary
This work addresses the challenge of insufficient visual evidence in ultra-high-resolution remote sensing imagery, where task-relevant regions are minuscule and sparse, thereby hindering effective multimodal reasoning. To overcome this, the authors propose a two-stage training paradigm that prioritizes textual knowledge before visual refinement: first, a reasoning scaffold is cold-started using geoscientific text-based question answering; then, supervised fine-tuning on imageβtext pairs provides stable guidance for subsequent agentic reinforcement learning (Agentic RLVR). This approach uniquely demonstrates the pivotal role of purely textual geoscientific knowledge in driving high-fidelity visual reasoning and further enhances reliability through knowledge graph validation. Evaluated on XLRS-Bench, the method achieves a state-of-the-art Pass@1 score of 60.40%, significantly outperforming larger general-purpose models such as GPT-5.2 and Gemini 3.0 Pro.
π Abstract
Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition: the model necessitates localizing tiny task-relevant regions in massive pixel spaces. While Agentic Reinforcement Learning with Verifiable Rewards (RLVR) using zoom-in tools offers a path forward, we find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors. In this paper, we investigate the interplay between post-training paradigms: comparing Cold-start Supervised Fine-Tuning (SFT), RLVR, and Agentic RLVR on the UHR RS benchmark.Our controlled studies yield a counter-intuitive finding: high-quality Earth-science text-only QA is a primary driver of UHR visual reasoning gains. Despite lacking images, domain-specific text injects the concepts, mechanistic explanations, and decision rules necessary to guide visual evidence retrieval.Based on this, we propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures;and (2)"pre-warming"on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL. This approach achieves a 60.40% Pass@1 on XLRS-Bench, significantly outperforming larger general purpose models (e.g., GPT-5.2, Gemini 3.0 Pro, Intern-S1) and establishing a new state-of-the-art.