RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of accurately mapping ambiguous natural language user intents to multi-granularity Earth observation vision tasks. To this end, we propose RemoteAgent, a reinforcement learning–based agent framework that dynamically identifies task granularity and allocates multimodal large language models to handle image-level and sparse-region tasks, while invoking specialized tools for dense prediction when necessary—thereby respecting the capability boundaries of each model component. Our contributions include VagueEO, the first human-centric instruction dataset for Earth observation tailored to vague queries, a novel intelligent task routing mechanism, and an end-to-end reinforcement fine-tuning strategy for alignment. Experiments demonstrate that our approach significantly enhances robustness in intent recognition and prediction performance across diverse Earth observation tasks, while maintaining computational efficiency.
📝 Abstract
Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.
Problem

Research questions and friction points this paper is trying to address.

Earth Observation
vague human intents
multi-granularity visual analysis
spatial predictions
intent recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic MLLMs
reinforcement learning fine-tuning
vague intent understanding
multi-granularity Earth Observation
Model Context Protocol
🔎 Similar Papers
2023-08-22Frontiers Comput. Sci.Citations: 866
L
Liang Yao
Hohai University
S
Shengxiang Xu
Southeast University
Fan Liu
Fan Liu
Hohai University
computer vision
C
Chuanyi Zhang
Hohai University
B
Bishun Yao
Hohai University
Rui Min
Rui Min
Hong Kong University of Science and Technology
Machine LearningAgentTrustworthy AI
Y
Yongjun Li
Hohai University
C
Chaoqian Ouyang
Sun Yat-sen University
S
Shimin Di
Southeast University
Min-Ling Zhang
Min-Ling Zhang
Professor, School of Computer Science and Engineering, Southeast University, China
Artificial IntelligenceMachine LearningData Mining