🤖 AI Summary
To address the challenge of enabling UAVs to interpret satellite imagery and natural language instructions, autonomously plan missions, and dynamically adapt flight trajectories in dynamic environments, this paper proposes a scalable multi-agent planning framework. Methodologically, it integrates a large language model (LLM) with a vision-language model (Qwen2.5-VL-7B) under the ReAct paradigm for cross-modal reasoning; it introduces a novel pixel-level visual grounding mechanism and a dynamic reactive thinking loop to support precise semantic target localization, online target refinement under environmental changes, and multi-UAV coordination. Technical contributions include fine-grained spatial alignment fine-tuning for satellite imagery, a new pointing-based interaction paradigm, and an open-source vision-language UAV planning benchmark dataset. Experiments demonstrate an average task generation latency of 96.96 seconds and a success rate of 93%, validated in industrial inspection and forest fire detection scenarios.
📝 Abstract
We present UAV-CodeAgents, a scalable multi-agent framework for autonomous UAV mission generation, built on large language and vision-language models (LLMs/VLMs). The system leverages the ReAct (Reason + Act) paradigm to interpret satellite imagery, ground high-level natural language instructions, and collaboratively generate UAV trajectories with minimal human supervision. A core component is a vision-grounded, pixel-pointing mechanism that enables precise localization of semantic targets on aerial maps. To support real-time adaptability, we introduce a reactive thinking loop, allowing agents to iteratively reflect on observations, revise mission goals, and coordinate dynamically in evolving environments. UAV-CodeAgents is evaluated on large-scale mission scenarios involving industrial and environmental fire detection. Our results show that a lower decoding temperature (0.5) yields higher planning reliability and reduced execution time, with an average mission creation time of 96.96 seconds and a success rate of 93%. We further fine-tune Qwen2.5VL-7B on 9,000 annotated satellite images, achieving strong spatial grounding across diverse visual categories. To foster reproducibility and future research, we will release the full codebase and a novel benchmark dataset for vision-language-based UAV planning.