GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

174K/year
🤖 AI Summary
This work addresses the challenge of detecting sparse, minute objects in ultra-high-resolution remote sensing imagery, where existing approaches often suffer from missed detections or duplicate counts due to single-path exploration that neglects global context. To overcome this, the authors propose GeoVista, a novel framework featuring a planning-driven active perception mechanism that formulates a global exploration plan to concurrently verify multiple candidate regions. GeoVista enables cross-region aggregation and deduplication through explicit evidence states, integrating a global–region–object interactive reasoning paradigm with a unified scale-invariant spatial representation. The framework synergistically combines vision–language models and reinforcement learning rewards via the APEX-GRO cold-start trajectory corpus, an Observe-Plan-Track mechanism, and GRPO policy optimization. Evaluated on RSHR-Bench, XLRS-Bench, and LRS-VQA, GeoVista significantly outperforms current remote sensing vision–language models.
📝 Abstract
Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at https://github.com/ryan6073/GeoVista
Problem

Research questions and friction points this paper is trying to address.

ultra-high-resolution remote sensing
active perception
visual grounding
exploration strategy
evidence aggregation
Innovation

Methods, ideas, or system contributions that make the work stand out.

active perception
global-region-object reasoning
multi-branch exploration
evidence de-duplication
GRPO-based planning
J
Jiashun Zhu
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
R
Ronghao Fu
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
J
Jiasen Hu
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
N
Nachuan Xing
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
X
Xu Na
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
X
Xiao Yang
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
Z
Zhiwen Lin
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
W
Weipeng Zhang
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
L
Lang Sun
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
Z
Zhiheng Xue
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
Haoran Liu
Haoran Liu
Ph.D. Student, Department of Computer Science & Engineering, Texas A&M University
LLMsGraph/Geometric LearningAI for ScienceGenerative Models
Weijie Zhang
Weijie Zhang
University of Kansas Medical Center
Inverse planningparticle therapy
B
Bo Yang
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education