"Hi AirStar, Guide Me to the Badminton Court."

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of conventional drones—reliance on remote controllers and insufficient environmental perception and autonomous task planning. To this end, we propose AirStar, an embodied intelligence platform that pioneers the integration of large language models (LLMs) as the cognitive core of unmanned aerial vehicles. AirStar unifies vision-language navigation, cross-modal voice/gesture interaction, contextual reasoning, geospatial knowledge modeling, and intelligent cinematography with target tracking. The framework enables both long-range geospatial navigation and short-range fine-grained control via natural language and gesture commands—eliminating the need for traditional remote controllers. Extensive real-world experiments demonstrate AirStar’s high efficiency and accuracy in interactive navigation, while its modular architecture ensures strong functional extensibility and seamless component integration. This work establishes a reusable technical paradigm for general-purpose, instruction-driven intelligent drone agents.

Technology Category

Application Category

📝 Abstract
Unmanned Aerial Vehicles, operating in environments with relatively few obstacles, offer high maneuverability and full three-dimensional mobility. This allows them to rapidly approach objects and perform a wide range of tasks often challenging for ground robots, making them ideal for exploration, inspection, aerial imaging, and everyday assistance. In this paper, we introduce AirStar, a UAV-centric embodied platform that turns a UAV into an intelligent aerial assistant: a large language model acts as the cognitive core for environmental understanding, contextual reasoning, and task planning. AirStar accepts natural interaction through voice commands and gestures, removing the need for a remote controller and significantly broadening its user base. It combines geospatial knowledge-driven long-distance navigation with contextual reasoning for fine-grained short-range control, resulting in an efficient and accurate vision-and-language navigation (VLN) capability.Furthermore, the system also offers built-in capabilities such as cross-modal question answering, intelligent filming, and target tracking. With a highly extensible framework, it supports seamless integration of new functionalities, paving the way toward a general-purpose, instruction-driven intelligent UAV agent. The supplementary PPT is available at href{https://buaa-colalab.github.io/airstar.github.io}{https://buaa-colalab.github.io/airstar.github.io}.
Problem

Research questions and friction points this paper is trying to address.

Develops UAV as intelligent assistant using language model
Enables natural voice and gesture interaction for UAV control
Combines geospatial navigation with contextual reasoning for VLN
Innovation

Methods, ideas, or system contributions that make the work stand out.

UAV-centric platform with large language model core
Natural voice and gesture interaction for control
Geospatial and contextual reasoning for precise navigation
🔎 Similar Papers
No similar papers found.