Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding

πŸ“… 2026-02-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of insufficient visual evidence in ultra-high-resolution remote sensing imagery, where task-relevant regions are minuscule and sparse, thereby hindering effective multimodal reasoning. To overcome this, the authors propose a two-stage training paradigm that prioritizes textual knowledge before visual refinement: first, a reasoning scaffold is cold-started using geoscientific text-based question answering; then, supervised fine-tuning on image–text pairs provides stable guidance for subsequent agentic reinforcement learning (Agentic RLVR). This approach uniquely demonstrates the pivotal role of purely textual geoscientific knowledge in driving high-fidelity visual reasoning and further enhances reliability through knowledge graph validation. Evaluated on XLRS-Bench, the method achieves a state-of-the-art Pass@1 score of 60.40%, significantly outperforming larger general-purpose models such as GPT-5.2 and Gemini 3.0 Pro.

Technology Category

Application Category

πŸ“ Abstract
Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition: the model necessitates localizing tiny task-relevant regions in massive pixel spaces. While Agentic Reinforcement Learning with Verifiable Rewards (RLVR) using zoom-in tools offers a path forward, we find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors. In this paper, we investigate the interplay between post-training paradigms: comparing Cold-start Supervised Fine-Tuning (SFT), RLVR, and Agentic RLVR on the UHR RS benchmark.Our controlled studies yield a counter-intuitive finding: high-quality Earth-science text-only QA is a primary driver of UHR visual reasoning gains. Despite lacking images, domain-specific text injects the concepts, mechanistic explanations, and decision rules necessary to guide visual evidence retrieval.Based on this, we propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures;and (2)"pre-warming"on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL. This approach achieves a 60.40% Pass@1 on XLRS-Bench, significantly outperforming larger general purpose models (e.g., GPT-5.2, Gemini 3.0 Pro, Intern-S1) and establishing a new state-of-the-art.
Problem

Research questions and friction points this paper is trying to address.

ultra-high-resolution remote sensing
visual evidence acquisition
multimodal reasoning
agentic reinforcement learning
domain priors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Staged Knowledge Injection
Agentic RLVR
Ultra-High-Resolution Remote Sensing
Text-to-Vision Transfer
Earth-Science Reasoning
πŸ”Ž Similar Papers
No similar papers found.