UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation

📅 2025-01-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low navigation accuracy and inefficient task planning for unmanned aerial vehicles (UAVs) operating under joint natural language instructions and satellite imagery, this paper proposes the first vision-language-action (VLA) system tailored for large-scale aerial task generation. The system achieves end-to-end integration of high-resolution satellite imagery, vision-language models (VLMs), and large language models (LLMs), leveraging KNN-based spatial retrieval to enable geographic-semantic alignment. It directly generates executable flight trajectories and action sequences from textual instructions. Experiments demonstrate a 22% reduction in trajectory length error and a mean squared Euclidean localization error of 34.22 meters on map-based target positioning—substantially outperforming baseline methods. This work establishes the first unified modeling framework jointly incorporating satellite imagery, natural language, and actionable UAV control policies, thereby introducing a novel paradigm for remote sensing–driven autonomous aerial agents.

Technology Category

Application Category

📝 Abstract
The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate communication with aerial robots. By integrating satellite imagery processing with the Visual Language Model (VLM) and the powerful capabilities of GPT, UAV-VLA enables users to generate general flight paths-and-action plans through simple text requests. This system leverages the rich contextual information provided by satellite images, allowing for enhanced decision-making and mission planning. The combination of visual analysis by VLM and natural language processing by GPT can provide the user with the path-and-action set, making aerial operations more efficient and accessible. The newly developed method showed the difference in the length of the created trajectory in 22% and the mean error in finding the objects of interest on a map in 34.22 m by Euclidean distance in the K-Nearest Neighbors (KNN) approach.
Problem

Research questions and friction points this paper is trying to address.

Drone Navigation
Task Efficiency
Imagery Analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

UAV-VLA
Visual Language Models (VLM)
GPT capabilities
🔎 Similar Papers
No similar papers found.