DroneVLA: VLA based Aerial Manipulation

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work proposes the first vision-language-action (VLA) system tailored for aerial manipulation, enabling non-expert users to command drones to perform grasping and delivery tasks through natural language instructions. The system integrates natural language understanding, semantic reasoning, and visual perception by leveraging Grounding DINO for object detection, MediaPipe for human pose estimation, dynamic A* for path planning, and RGB-D visual servoing for precise control, thereby achieving end-to-end task interpretation and execution. Experimental results in real-world environments demonstrate high accuracy and feasibility, with maximum, mean, and root-mean-square errors of 0.164 m, 0.070 m, and 0.084 m, respectively, in localization and navigation. This study represents the first successful application of a VLA model to aerial manipulation scenarios.

Technology Category

Application Category

📝 Abstract

As aerial platforms evolve from passive observers to active manipulators, the challenge shifts toward designing intuitive interfaces that allow non-expert users to command these systems naturally. This work introduces a novel concept of autonomous aerial manipulation system capable of interpreting high-level natural language commands to retrieve objects and deliver them to a human user. The system is intended to integrate a MediaPipe based on Grounding DINO and a Vision-Language-Action (VLA) model with a custom-built drone equipped with a 1-DOF gripper and an Intel RealSense RGB-D camera. VLA performs semantic reasoning to interpret the intent of a user prompt and generates a prioritized task queue for grasping of relevant objects in the scene. Grounding DINO and dynamic A* planning algorithm are used to navigate and safely relocate the object. To ensure safe and natural interaction during the handover phase, the system employs a human-centric controller driven by MediaPipe. This module provides real-time human pose estimation, allowing the drone to employ visual servoing to maintain a stable, distinct position directly in front of the user, facilitating a comfortable handover. We demonstrate the system's efficacy through real-world experiments for localization and navigation, which resulted in a 0.164m, 0.070m, and 0.084m of max, mean euclidean, and root-mean squared errors, respectively, highlighting the feasibility of VLA for aerial manipulation operations.

Problem

Research questions and friction points this paper is trying to address.

aerial manipulation

natural language commands

human-drone interaction

object handover

non-expert users

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)

natural language command

aerial manipulation