AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding

📅 2025-04-13

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing UAV systems exhibit low task execution efficiency in dynamic environments and heavily rely on human monitoring and decision-making. Method: This paper proposes an end-to-end embodied UAV agent system tailored for real-world applications such as logistics delivery and disaster response. It introduces a novel synergistic architecture integrating task-driven agent scheduling, multimodal (vision-temporal-semantic) perception fusion, and scene-adaptive keyframe extraction, thereby establishing an embodied decision-making and reasoning framework enabling zero-shot cross-scenario generalization. Contribution/Results: Experimental evaluation demonstrates that the system achieves high-accuracy semantic understanding and autonomous response across diverse dynamic real-world scenarios, significantly improving environmental adaptability and task execution efficiency. It overcomes fundamental limitations of conventional human–UAV collaborative paradigms.

Technology Category

Application Category

📝 Abstract

Unmanned Aerial Vehicles (UAVs) are increasingly important in dynamic environments such as logistics transportation and disaster response. However, current tasks often rely on human operators to monitor aerial videos and make operational decisions. This mode of human-machine collaboration suffers from significant limitations in efficiency and adaptability. In this paper, we present AirVista-II -- an end-to-end agentic system for embodied UAVs, designed to enable general-purpose semantic understanding and reasoning in dynamic scenes. The system integrates agent-based task identification and scheduling, multimodal perception mechanisms, and differentiated keyframe extraction strategies tailored for various temporal scenarios, enabling the efficient capture of critical scene information. Experimental results demonstrate that the proposed system achieves high-quality semantic understanding across diverse UAV-based dynamic scenarios under a zero-shot setting.

Problem

Research questions and friction points this paper is trying to address.

Enabling UAVs to autonomously understand dynamic scenes semantically

Reducing human reliance for monitoring aerial videos and decisions

Improving efficiency and adaptability in human-UAV collaboration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-based task identification and scheduling

Multimodal perception mechanisms integration

Differentiated keyframe extraction strategies

🔎 Similar Papers

CloudTrack: Scalable UAV Tracking with Cloud Semantics