GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of tool usage homogenization in existing scaling-aware multimodal large language models when applied to ultra-high-resolution remote sensing image question answering, which often impedes the acquisition of task-relevant evidence. To this end, the authors propose the GeoEyes framework, which first constructs UHR-CoZ—a cold-start instruction-tuning dataset encompassing diverse zooming strategies—and then introduces AdaZoom-GRPO, an adaptive reinforcement learning method driven by evidence gain and answer improvement rewards. This enables the model to develop visual exploration capabilities characterized by on-demand focusing and timely termination. The study presents the first systematic solution to tool homogenization in remote sensing image understanding and establishes an evidence-driven, staged training paradigm. Evaluated on XLRS-Bench, the proposed approach achieves 54.23% accuracy, significantly outperforming current state-of-the-art methods and validating the efficacy of the introduced mechanisms.

Technology Category

Application Category

📝 Abstract
The"thinking-with-images"paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.
Problem

Research questions and friction points this paper is trying to address.

Tool Usage Homogenization
Ultra-High-Resolution Remote Sensing
Visual Question Answering
Evidence Acquisition
Multimodal Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-demand visual focusing
tool usage homogenization
Chain-of-Zoom
AdaZoom-GRPO
evidence-grounded reasoning
🔎 Similar Papers
No similar papers found.
Fengxiang Wang
Fengxiang Wang
National University of Defense Technology
Computer VisionRemote Sensing
M
Mingshuo Chen
Beijing University of Posts and Telecommunications, China
Y
Yueying Li
National University of Defense Technology, China
Y
Yajie Yang
University of the Chinese Academy of Sciences, China
Y
Yifan Zhang
Chinese Academy of Science, China
L
Long Lan
National University of Defense Technology, China
X
Xue Yang
Shanghai Jiao Tong University, China
Hongda Sun
Hongda Sun
Renmin University of China
Natural Language ProcessingLarge Language ModelsAI for Healthcare
Yulin Wang
Yulin Wang
Shanghai Jiao Tong University
Di Wang
Di Wang
School of Computer Science, Wuhan University
Remote SensingDeep LearningComputer VisionHyperspectral Image Clasification
Jun Song
Jun Song
Shenzhen University
nanophotonics
J
Jing Zhang
Wuhan University, China
Bo Du
Bo Du
Department of Management, Griffith Business School
Sustainable TransportTravel BehaviourUrban Data AnalyticsLogistics and Supply Chain