$π$, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge of transferring vision–language–action (VLA) models, pretrained on fixed-base robotic platforms, to highly dynamic and underactuated aerial manipulation systems, where dynamics mismatch severely degrades performance. To bridge this gap, the authors propose a payload-aware guidance mechanism that injects physical constraints during inference, alongside a synthetic navigation dataset generated via Gaussian splatting to alleviate real-world data scarcity. Notably, the approach enables effective zero-shot transfer to aerial grasping and navigation tasks without fine-tuning the base VLA model. Extensive real-world evaluation across 460 trials demonstrates that the synthetic data boosts navigation success from 81% to 100%, while payload-aware guidance increases grasping success from 23% to 50%. The integrated system achieves a 62% success rate on long-horizon compositional tasks.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models such as $π_0$ have demonstrated remarkable generalization across diverse fixed-base manipulators. However, transferring these foundation models to aerial platforms remains an open challenge due to the fundamental mismatch between the quasi-static dynamics of fixed-base arms and the underactuated, highly dynamic nature of flight. In this work, we introduce AirVLA, a system that investigates the transferability of manipulation-pretrained VLAs to aerial pick-and-place tasks. We find that while visual representations transfer effectively, the specific control dynamics required for flight do not. To bridge this "dynamics gap" without retraining the foundation model, we introduce a Payload-Aware Guidance mechanism that injects payload constraints directly into the policy's flow-matching sampling process. To overcome data scarcity, we further utilize a Gaussian Splatting pipeline to synthesize navigation training data. We evaluate our method through a cumulative 460 real-world experiments which demonstrate that this synthetic data is a key enabler of performance, unlocking 100% success in navigation tasks where directly fine-tuning on teleoperation data alone attains 81% success. Our inference-time intervention, Payload-Aware Guidance, increases real-world pick-and-place task success from 23% to 50%. Finally, we evaluate the model on a long-horizon compositional task, achieving a 62% overall success rate. These results suggest that pre-trained manipulation VLAs, with appropriate data augmentation and physics-informed guidance, can transfer to aerial manipulation and navigation, as well as the composition of these tasks.

Problem

Research questions and friction points this paper is trying to address.

aerial manipulation

vision-language-action models

dynamics gap

model transfer

underactuated systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Physics-Guided Transfer

Payload-Aware Guidance

Aerial Manipulation