AIR-VLA: Vision-Language-Action Systems for Aerial Manipulation

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing vision-language-action (VLA) models struggle to address the challenges posed by aerial manipulation scenarios, including floating-base dynamics, strong coupling between drones and robotic arms, and long-horizon tasks. To bridge this gap, this work presents the first VLA benchmark platform tailored for aerial manipulation, featuring a physics-based simulation environment and 3,000 multimodal teleoperated demonstration trajectories spanning basic control, object understanding, semantic reasoning, and long-horizon planning tasks. The study introduces a standardized evaluation framework, a multidimensional metric suite, and a high-quality multimodal dataset specifically designed for aerial manipulator systems. Through systematic evaluation of state-of-the-art VLA and vision-language models, the work validates the feasibility of transferring the VLA paradigm to aerial robotics and reveals its performance boundaries in mobility, control, and high-level planning, thereby establishing foundational data and benchmark resources for general-purpose aerial robot research.

Technology Category

Application Category

📝 Abstract

While Vision-Language-Action (VLA) models have achieved remarkable success in ground-based embodied intelligence, their application to Aerial Manipulation Systems (AMS) remains a largely unexplored frontier. The inherent characteristics of AMS, including floating-base dynamics, strong coupling between the UAV and the manipulator, and the multi-step, long-horizon nature of operational tasks, pose severe challenges to existing VLA paradigms designed for static or 2D mobile bases. To bridge this gap, we propose \textbf{AIR-VLA}, the first VLA benchmark specifically tailored for aerial manipulation. We construct a physics-based simulation environment and release a high-quality multimodal dataset comprising 3000 manually teleoperated demonstrations, covering base manipulation, object \&spatial understanding, semantic reasoning, and long-horizon planning. Leveraging this platform, we systematically evaluate mainstream VLA models and state-of-the-art VLM models. Our experiments not only validate the feasibility of transferring VLA paradigms to aerial systems but also, through multi-dimensional metrics tailored to aerial tasks, reveal the capabilities and boundaries of current models regarding UAV mobility, manipulator control, and high-level planning. \textbf{AIR-VLA} establishes a standardized testbed and data foundation for future research in general-purpose aerial robotics. The resource of AIR-VLA will be available at https://github.com/SpencerSon2001/AIR-VLA.

Problem

Research questions and friction points this paper is trying to address.

Aerial Manipulation

Vision-Language-Action

Floating-base Dynamics

UAV-Manipulator Coupling

Long-horizon Tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action

Aerial Manipulation

Physics-based Simulation