UAV-CodeAgents: Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of enabling UAVs to interpret satellite imagery and natural language instructions, autonomously plan missions, and dynamically adapt flight trajectories in dynamic environments, this paper proposes a scalable multi-agent planning framework. Methodologically, it integrates a large language model (LLM) with a vision-language model (Qwen2.5-VL-7B) under the ReAct paradigm for cross-modal reasoning; it introduces a novel pixel-level visual grounding mechanism and a dynamic reactive thinking loop to support precise semantic target localization, online target refinement under environmental changes, and multi-UAV coordination. Technical contributions include fine-grained spatial alignment fine-tuning for satellite imagery, a new pointing-based interaction paradigm, and an open-source vision-language UAV planning benchmark dataset. Experiments demonstrate an average task generation latency of 96.96 seconds and a success rate of 93%, validated in industrial inspection and forest fire detection scenarios.

Technology Category

Application Category

📝 Abstract
We present UAV-CodeAgents, a scalable multi-agent framework for autonomous UAV mission generation, built on large language and vision-language models (LLMs/VLMs). The system leverages the ReAct (Reason + Act) paradigm to interpret satellite imagery, ground high-level natural language instructions, and collaboratively generate UAV trajectories with minimal human supervision. A core component is a vision-grounded, pixel-pointing mechanism that enables precise localization of semantic targets on aerial maps. To support real-time adaptability, we introduce a reactive thinking loop, allowing agents to iteratively reflect on observations, revise mission goals, and coordinate dynamically in evolving environments. UAV-CodeAgents is evaluated on large-scale mission scenarios involving industrial and environmental fire detection. Our results show that a lower decoding temperature (0.5) yields higher planning reliability and reduced execution time, with an average mission creation time of 96.96 seconds and a success rate of 93%. We further fine-tune Qwen2.5VL-7B on 9,000 annotated satellite images, achieving strong spatial grounding across diverse visual categories. To foster reproducibility and future research, we will release the full codebase and a novel benchmark dataset for vision-language-based UAV planning.
Problem

Research questions and friction points this paper is trying to address.

Autonomous UAV mission generation with minimal human supervision
Precise localization of semantic targets on aerial maps
Real-time adaptability in evolving environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework with LLMs/VLMs for UAV missions
Vision-grounded pixel-pointing for precise target localization
Reactive thinking loop for real-time mission adaptation
🔎 Similar Papers
No similar papers found.