GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

📅 2026-02-15

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge of tool usage homogenization in existing scaling-aware multimodal large language models when applied to ultra-high-resolution remote sensing image question answering, which often impedes the acquisition of task-relevant evidence. To this end, the authors propose the GeoEyes framework, which first constructs UHR-CoZ—a cold-start instruction-tuning dataset encompassing diverse zooming strategies—and then introduces AdaZoom-GRPO, an adaptive reinforcement learning method driven by evidence gain and answer improvement rewards. This enables the model to develop visual exploration capabilities characterized by on-demand focusing and timely termination. The study presents the first systematic solution to tool homogenization in remote sensing image understanding and establishes an evidence-driven, staged training paradigm. Evaluated on XLRS-Bench, the proposed approach achieves 54.23% accuracy, significantly outperforming current state-of-the-art methods and validating the efficacy of the introduced mechanisms.

Technology Category

Application Category

📝 Abstract

The"thinking-with-images"paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.

Problem

Research questions and friction points this paper is trying to address.

Tool Usage Homogenization

Ultra-High-Resolution Remote Sensing

Visual Question Answering

Evidence Acquisition

Multimodal Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-demand visual focusing

tool usage homogenization

Chain-of-Zoom