π€ AI Summary
Existing UAV systems exhibit low task execution efficiency in dynamic environments and heavily rely on human monitoring and decision-making. Method: This paper proposes an end-to-end embodied UAV agent system tailored for real-world applications such as logistics delivery and disaster response. It introduces a novel synergistic architecture integrating task-driven agent scheduling, multimodal (vision-temporal-semantic) perception fusion, and scene-adaptive keyframe extraction, thereby establishing an embodied decision-making and reasoning framework enabling zero-shot cross-scenario generalization. Contribution/Results: Experimental evaluation demonstrates that the system achieves high-accuracy semantic understanding and autonomous response across diverse dynamic real-world scenarios, significantly improving environmental adaptability and task execution efficiency. It overcomes fundamental limitations of conventional humanβUAV collaborative paradigms.
π Abstract
Unmanned Aerial Vehicles (UAVs) are increasingly important in dynamic environments such as logistics transportation and disaster response. However, current tasks often rely on human operators to monitor aerial videos and make operational decisions. This mode of human-machine collaboration suffers from significant limitations in efficiency and adaptability. In this paper, we present AirVista-II -- an end-to-end agentic system for embodied UAVs, designed to enable general-purpose semantic understanding and reasoning in dynamic scenes. The system integrates agent-based task identification and scheduling, multimodal perception mechanisms, and differentiated keyframe extraction strategies tailored for various temporal scenarios, enabling the efficient capture of critical scene information. Experimental results demonstrate that the proposed system achieves high-quality semantic understanding across diverse UAV-based dynamic scenarios under a zero-shot setting.