🤖 AI Summary
This work addresses the challenge of detecting sparse, minute objects in ultra-high-resolution remote sensing imagery, where existing approaches often suffer from missed detections or duplicate counts due to single-path exploration that neglects global context. To overcome this, the authors propose GeoVista, a novel framework featuring a planning-driven active perception mechanism that formulates a global exploration plan to concurrently verify multiple candidate regions. GeoVista enables cross-region aggregation and deduplication through explicit evidence states, integrating a global–region–object interactive reasoning paradigm with a unified scale-invariant spatial representation. The framework synergistically combines vision–language models and reinforcement learning rewards via the APEX-GRO cold-start trajectory corpus, an Observe-Plan-Track mechanism, and GRPO policy optimization. Evaluated on RSHR-Bench, XLRS-Bench, and LRS-VQA, GeoVista significantly outperforms current remote sensing vision–language models.
📝 Abstract
Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at https://github.com/ryan6073/GeoVista