🤖 AI Summary
To address low navigation accuracy and inefficient task planning for unmanned aerial vehicles (UAVs) operating under joint natural language instructions and satellite imagery, this paper proposes the first vision-language-action (VLA) system tailored for large-scale aerial task generation. The system achieves end-to-end integration of high-resolution satellite imagery, vision-language models (VLMs), and large language models (LLMs), leveraging KNN-based spatial retrieval to enable geographic-semantic alignment. It directly generates executable flight trajectories and action sequences from textual instructions. Experiments demonstrate a 22% reduction in trajectory length error and a mean squared Euclidean localization error of 34.22 meters on map-based target positioning—substantially outperforming baseline methods. This work establishes the first unified modeling framework jointly incorporating satellite imagery, natural language, and actionable UAV control policies, thereby introducing a novel paradigm for remote sensing–driven autonomous aerial agents.
📝 Abstract
The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate communication with aerial robots. By integrating satellite imagery processing with the Visual Language Model (VLM) and the powerful capabilities of GPT, UAV-VLA enables users to generate general flight paths-and-action plans through simple text requests. This system leverages the rich contextual information provided by satellite images, allowing for enhanced decision-making and mission planning. The combination of visual analysis by VLM and natural language processing by GPT can provide the user with the path-and-action set, making aerial operations more efficient and accessible. The newly developed method showed the difference in the length of the created trajectory in 22% and the mean error in finding the objects of interest on a map in 34.22 m by Euclidean distance in the K-Nearest Neighbors (KNN) approach.