AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding

πŸ“… 2025-04-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing UAV systems exhibit low task execution efficiency in dynamic environments and heavily rely on human monitoring and decision-making. Method: This paper proposes an end-to-end embodied UAV agent system tailored for real-world applications such as logistics delivery and disaster response. It introduces a novel synergistic architecture integrating task-driven agent scheduling, multimodal (vision-temporal-semantic) perception fusion, and scene-adaptive keyframe extraction, thereby establishing an embodied decision-making and reasoning framework enabling zero-shot cross-scenario generalization. Contribution/Results: Experimental evaluation demonstrates that the system achieves high-accuracy semantic understanding and autonomous response across diverse dynamic real-world scenarios, significantly improving environmental adaptability and task execution efficiency. It overcomes fundamental limitations of conventional human–UAV collaborative paradigms.

Technology Category

Application Category

πŸ“ Abstract
Unmanned Aerial Vehicles (UAVs) are increasingly important in dynamic environments such as logistics transportation and disaster response. However, current tasks often rely on human operators to monitor aerial videos and make operational decisions. This mode of human-machine collaboration suffers from significant limitations in efficiency and adaptability. In this paper, we present AirVista-II -- an end-to-end agentic system for embodied UAVs, designed to enable general-purpose semantic understanding and reasoning in dynamic scenes. The system integrates agent-based task identification and scheduling, multimodal perception mechanisms, and differentiated keyframe extraction strategies tailored for various temporal scenarios, enabling the efficient capture of critical scene information. Experimental results demonstrate that the proposed system achieves high-quality semantic understanding across diverse UAV-based dynamic scenarios under a zero-shot setting.
Problem

Research questions and friction points this paper is trying to address.

Enabling UAVs to autonomously understand dynamic scenes semantically
Reducing human reliance for monitoring aerial videos and decisions
Improving efficiency and adaptability in human-UAV collaboration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-based task identification and scheduling
Multimodal perception mechanisms integration
Differentiated keyframe extraction strategies
πŸ”Ž Similar Papers
No similar papers found.
Fei Lin
Fei Lin
Macau University of Science and Technology
Parallel IntelligenceLarge Language ModelEmbodied AgentAI4Science
Yonglin Tian
Yonglin Tian
Institute of Automation, Chinese Academy of Sciences
Parallel intelligenceParallel umanned systemsIntelligent vehiclesAutonomous driving
T
Tengchao Zhang
Department of Engineering Science, Faculty of Innovation Engineering, Macau University of Science and Technology, Macau 999078, China
J
Jun Huang
Department of Engineering Science, Faculty of Innovation Engineering, Macau University of Science and Technology, Macau 999078, China
S
Sangtian Guan
Department of Engineering Science, Faculty of Innovation Engineering, Macau University of Science and Technology, Macau 999078, China
Fei-Yue Wang
Fei-Yue Wang
Professor, Formerly The University of Arizona, Currently Chinese Academy of Sciences
Intelligent SystemsIntelligent VehiclesRobotics and AutomationBlockchainDAO