RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of accurately mapping ambiguous natural language user intents to multi-granularity Earth observation vision tasks. To this end, we propose RemoteAgent, a reinforcement learning–based agent framework that dynamically identifies task granularity and allocates multimodal large language models to handle image-level and sparse-region tasks, while invoking specialized tools for dense prediction when necessary—thereby respecting the capability boundaries of each model component. Our contributions include VagueEO, the first human-centric instruction dataset for Earth observation tailored to vague queries, a novel intelligent task routing mechanism, and an end-to-end reinforcement fine-tuning strategy for alignment. Experiments demonstrate that our approach significantly enhances robustness in intent recognition and prediction performance across diverse Earth observation tasks, while maintaining computational efficiency.

Technology Category

Application Category

📝 Abstract

Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.

Problem

Research questions and friction points this paper is trying to address.

Earth Observation

vague human intents

multi-granularity visual analysis

spatial predictions

intent recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic MLLMs

reinforcement learning fine-tuning

vague intent understanding