AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models

📅 2024-08-28
🏛️ arXiv.org
📈 Citations: 10
Influential: 1
📄 PDF
🤖 AI Summary
Existing embodied world model research focuses predominantly on indoor ground agents, leaving aerial embodied intelligence—particularly for unmanned aerial vehicles (UAVs)—largely unexplored. Method: We propose SkyAgentBench, the first aerospace-oriented embodied world model benchmark for UAVs, integrating 2D/3D vision-language modeling, egocentric image-text-pose alignment, instruction tuning, and simulation-based training. It comprises (1) AerialAgent-Ego10k—a novel real-world egocentric pretraining dataset—and CyberAgent-Ego500k—a synthetically aligned virtual counterpart; (2) an instruction-tuning set covering five core embodied tasks (navigation, perception, planning, etc.); and (3) SkyAgentEval, a GPT-4–driven, multidimensional, interpretable automated evaluation framework. Results: Comprehensive evaluation of over ten vision-language models (VLMs) reveals, for the first time, their capability boundaries and actionable optimization pathways for UAV-specific embodied tasks.

Technology Category

Application Category

📝 Abstract
Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied world model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. However, existing embodied world models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored. To address this gap, we construct the first large-scale real-world image-text pre-training dataset, AerialAgent-Ego10k, featuring urban drones from a first-person perspective. We also create a virtual image-text-pose alignment dataset, CyberAgent Ego500k, to facilitate the pre-training of the aerospace embodied world model. For the first time, we clearly define 5 downstream tasks, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and construct corresponding instruction datasets, i.e., SkyAgent-Scene3k, SkyAgent-Reason3k, SkyAgent-Nav3k and SkyAgent-Plan3k, and SkyAgent-Act3k, for fine-tuning the aerospace embodiment world model. Simultaneously, we develop SkyAgentEval, the downstream task evaluation metrics based on GPT-4, to comprehensively, flexibly, and objectively assess the results, revealing the potential and limitations of 2D/3D visual language models in UAV-agent tasks. Furthermore, we integrate over 10 2D/3D visual-language models, 2 pre-training datasets, 5 finetuning datasets, more than 10 evaluation metrics, and a simulator into the benchmark suite, i.e., AeroVerse, which will be released to the community to promote exploration and development of aerospace embodied intelligence.
Problem

Research questions and friction points this paper is trying to address.

Developing aerospace embodied world models for UAV autonomous intelligence
Creating datasets and benchmarks for UAV-agent training and evaluation
Addressing the lack of research on UAV intelligent agents in aerospace
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created first-person drone datasets for pre-training
Defined five downstream tasks with instruction datasets
Integrated models datasets metrics into benchmark suite
🔎 Similar Papers
No similar papers found.
F
Fanglong Yao
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, and with the Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Y
Yuanchang Yue
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, and also with the University of Chinese Academy of Sciences, Beijing 100190, China, and with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China, and with the Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Y
Youzhi Liu
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, and also with the University of Chinese Academy of Sciences, Beijing 100190, China, and with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China, and with the Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Xian Sun
Xian Sun
Aerospace Information Research Institute, Chinese Academy of Sciences
Remote SensingComputer Vision and Pattern RecognitionArtificial Intelligence
K
Kun Fu
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, and also with the University of Chinese Academy of Sciences, Beijing 100190, China, and with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China, and with the Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China