ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation benchmarks for remote sensing agents lack realistic, application-oriented assessment of tool-use capabilities. Method: We introduce the first tool-augmented benchmark for remote sensing agents, covering seven real-world tasks—including urban planning and disaster assessment—that require multi-step tool invocation and spatial reasoning over satellite/aerial imagery. Our framework systematically evaluates tool-use proficiency through structured task design, human-in-the-loop query construction, and a dual-dimensional evaluation metric (“step-by-step execution” + “final answer”). Built upon the ReAct paradigm, it integrates remote sensing understanding, geospatial tool invocation, and multi-step planning. Results: Evaluated on 436 tasks, our benchmark reveals significant disparities among models (e.g., GPT-4o, Qwen2.5) in tool accuracy and planning consistency. All code and data are publicly released, establishing a foundational benchmark for embodied intelligence in remote sensing.

Technology Category

Application Category

📝 Abstract
Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications such as urban planning, disaster assessment and change analysis, environmental monitoring, transportation analysis, aviation monitoring, recreational infrastructure, and industrial site analysis. Each query is grounded in satellite or aerial imagery and requires agents to reason through a diverse toolset. We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs (e.g., GPT-4o, Qwen2.5) on 436 structured agentic tasks. The benchmark reports both step-wise execution metrics and final answer correctness. Our analysis reveals notable disparities in tool accuracy and planning consistency across models. ThinkGeo provides the first extensive testbed for evaluating how tool-enabled LLMs handle spatial reasoning in remote sensing. Our code and dataset are publicly available
Problem

Research questions and friction points this paper is trying to address.

Evaluating tool-augmented LLM agents in remote sensing tasks
Assessing domain-specific tool-use for spatial reasoning challenges
Benchmarking multi-step planning accuracy with diverse real-world applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-augmented agents for remote sensing tasks
ReAct-style interaction loop for multi-step planning
Diverse toolset for spatial reasoning in imagery
A
Akashah Shabbir
Mohamed bin Zayed University of AI
Muhammad Akhtar Munir
Muhammad Akhtar Munir
Mohamed bin Zayed University of Artificial Intelligence, UAE
Deep LearningModel CalibrationDomain GeneralizationVLMsRemote Sensing
Akshay Dudhane
Akshay Dudhane
SPACE42, Ex Research Scientist, MBZUAI. PhD IIT Ropar
Computer VisionBurst ProcessingMedical Image AnalysisLow-level VisionImage Enhancement
M
Muhammad Umer Sheikh
Mohamed bin Zayed University of AI
M
Muhammad Haris Khan
Mohamed bin Zayed University of AI
P
Paolo Fraccaro
IBM Research
J
Juan Bernabe Moreno
IBM Research
Fahad Shahbaz Khan
Fahad Shahbaz Khan
MBZUAI, Linköping University Sweden
Computer VisionObject RecognitionGenerative AIAI for Science
S
Salman Khan
Mohamed bin Zayed University of AI, Australian National University