AERMANI-VLM: Structured Prompting and Reasoning for Aerial Manipulation with Vision Language Models

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) exhibit instruction inconsistency, hallucination, and dynamically infeasible outputs when directly deployed to control aerial manipulators, compromising safety and reliability. Method: We propose the first VLM-based framework for aerial manipulation, decoupling high-level reasoning from low-level control. Our approach integrates structured prompt engineering, natural language chain-of-thought reasoning, and a discrete safety skill library—requiring no model fine-tuning—to generate interpretable, temporally consistent, and hallucination-resistant motion plans. Contribution/Results: To our knowledge, this is the first work to adapt pre-trained VLMs to aerial manipulation. Evaluated in both simulation and on physical hardware, our framework demonstrates strong generalization across unseen instructions, objects, and environments in multi-step pick-and-place tasks, significantly improving task success rate and operational safety.

Technology Category

Application Category

📝 Abstract
The rapid progress of vision--language models (VLMs) has sparked growing interest in robotic control, where natural language can express the operation goals while visual feedback links perception to action. However, directly deploying VLM-driven policies on aerial manipulators remains unsafe and unreliable since the generated actions are often inconsistent, hallucination-prone, and dynamically infeasible for flight. In this work, we present AERMANI-VLM, the first framework to adapt pretrained VLMs for aerial manipulation by separating high-level reasoning from low-level control, without any task-specific fine-tuning. Our framework encodes natural language instructions, task context, and safety constraints into a structured prompt that guides the model to generate a step-by-step reasoning trace in natural language. This reasoning output is used to select from a predefined library of discrete, flight-safe skills, ensuring interpretable and temporally consistent execution. By decoupling symbolic reasoning from physical action, AERMANI-VLM mitigates hallucinated commands and prevents unsafe behavior, enabling robust task completion. We validate the framework in both simulation and hardware on diverse multi-step pick-and-place tasks, demonstrating strong generalization to previously unseen commands, objects, and environments.
Problem

Research questions and friction points this paper is trying to address.

Ensuring safe VLM-driven control for aerial manipulators
Mitigating inconsistent and hallucinated robot commands
Separating reasoning from control for robust task execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured prompting separates reasoning from control
Discrete skill library ensures flight-safe execution
Symbolic reasoning prevents hallucinated unsafe commands
🔎 Similar Papers
No similar papers found.
S
Sarthak Mishra
Robotics Research Center, IIIT Hyderabad, India
Rishabh Dev Yadav
Rishabh Dev Yadav
PhD Candidate, University of Manchester
RoboticsControl System
A
Avirup Das
Department of Computer Science, University of Manchester, UK
Saksham Gupta
Saksham Gupta
Neurosurgery Resident
Neurosurgery
W
Wei Pan
Department of Computer Science, University of Manchester, UK
Spandan Roy
Spandan Roy
Assistant Professor, Robotics Research Center, IIIT Hyderabad
Adaptive-robust controlSwitched systemsArtificial delay controlRobotics