AIR-VLA: Vision-Language-Action Systems for Aerial Manipulation

πŸ“… 2026-01-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language-action (VLA) models struggle to address the challenges posed by aerial manipulation scenarios, including floating-base dynamics, strong coupling between drones and robotic arms, and long-horizon tasks. To bridge this gap, this work presents the first VLA benchmark platform tailored for aerial manipulation, featuring a physics-based simulation environment and 3,000 multimodal teleoperated demonstration trajectories spanning basic control, object understanding, semantic reasoning, and long-horizon planning tasks. The study introduces a standardized evaluation framework, a multidimensional metric suite, and a high-quality multimodal dataset specifically designed for aerial manipulator systems. Through systematic evaluation of state-of-the-art VLA and vision-language models, the work validates the feasibility of transferring the VLA paradigm to aerial robotics and reveals its performance boundaries in mobility, control, and high-level planning, thereby establishing foundational data and benchmark resources for general-purpose aerial robot research.

Technology Category

Application Category

πŸ“ Abstract
While Vision-Language-Action (VLA) models have achieved remarkable success in ground-based embodied intelligence, their application to Aerial Manipulation Systems (AMS) remains a largely unexplored frontier. The inherent characteristics of AMS, including floating-base dynamics, strong coupling between the UAV and the manipulator, and the multi-step, long-horizon nature of operational tasks, pose severe challenges to existing VLA paradigms designed for static or 2D mobile bases. To bridge this gap, we propose \textbf{AIR-VLA}, the first VLA benchmark specifically tailored for aerial manipulation. We construct a physics-based simulation environment and release a high-quality multimodal dataset comprising 3000 manually teleoperated demonstrations, covering base manipulation, object \&spatial understanding, semantic reasoning, and long-horizon planning. Leveraging this platform, we systematically evaluate mainstream VLA models and state-of-the-art VLM models. Our experiments not only validate the feasibility of transferring VLA paradigms to aerial systems but also, through multi-dimensional metrics tailored to aerial tasks, reveal the capabilities and boundaries of current models regarding UAV mobility, manipulator control, and high-level planning. \textbf{AIR-VLA} establishes a standardized testbed and data foundation for future research in general-purpose aerial robotics. The resource of AIR-VLA will be available at https://github.com/SpencerSon2001/AIR-VLA.
Problem

Research questions and friction points this paper is trying to address.

Aerial Manipulation
Vision-Language-Action
Floating-base Dynamics
UAV-Manipulator Coupling
Long-horizon Tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action
Aerial Manipulation
Physics-based Simulation
Multimodal Dataset
Embodied Intelligence
J
Jianli Sun
the Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Bin Tian
Bin Tian
δΈ­ε›½η§‘ε­¦ι™’θ‡ͺεŠ¨εŒ–η ”η©Άζ‰€
Parallel IntelligenceParallel DrivingIntelligent Mining
Q
Qiyao Zhang
School of Automation, Beijing Institute of Technology, Beijing 100081, China
C
Chengxiang Li
school of information and intelligent engineering, University of Sanya, Sanya 572000, Hainan Province, China
Z
Zihan Song
school of Mechanical and Vehicle Engineering, Hunan University, Changsha 410082, Hunan Province, China
Zhiyong Cui
Zhiyong Cui
Professor, Beihang University
Foundation ModelsAutonomous DrivingUrban ComputingTraffic PredictionTraffic Control
Yisheng Lv
Yisheng Lv
The University of Chinese Academy of Sciences, and Chinese Academy of Sciences
Parallel IntelligenceAI for TransportationAutonomous VehiclesParallel Transportation Systems
Yonglin Tian
Yonglin Tian
Institute of Automation, Chinese Academy of Sciences
Parallel intelligenceParallel umanned systemsIntelligent vehiclesAutonomous driving