🤖 AI Summary
Vision-language models (VLMs) exhibit instruction inconsistency, hallucination, and dynamically infeasible outputs when directly deployed to control aerial manipulators, compromising safety and reliability. Method: We propose the first VLM-based framework for aerial manipulation, decoupling high-level reasoning from low-level control. Our approach integrates structured prompt engineering, natural language chain-of-thought reasoning, and a discrete safety skill library—requiring no model fine-tuning—to generate interpretable, temporally consistent, and hallucination-resistant motion plans. Contribution/Results: To our knowledge, this is the first work to adapt pre-trained VLMs to aerial manipulation. Evaluated in both simulation and on physical hardware, our framework demonstrates strong generalization across unseen instructions, objects, and environments in multi-step pick-and-place tasks, significantly improving task success rate and operational safety.
📝 Abstract
The rapid progress of vision--language models (VLMs) has sparked growing interest in robotic control, where natural language can express the operation goals while visual feedback links perception to action. However, directly deploying VLM-driven policies on aerial manipulators remains unsafe and unreliable since the generated actions are often inconsistent, hallucination-prone, and dynamically infeasible for flight. In this work, we present AERMANI-VLM, the first framework to adapt pretrained VLMs for aerial manipulation by separating high-level reasoning from low-level control, without any task-specific fine-tuning. Our framework encodes natural language instructions, task context, and safety constraints into a structured prompt that guides the model to generate a step-by-step reasoning trace in natural language. This reasoning output is used to select from a predefined library of discrete, flight-safe skills, ensuring interpretable and temporally consistent execution. By decoupling symbolic reasoning from physical action, AERMANI-VLM mitigates hallucinated commands and prevents unsafe behavior, enabling robust task completion. We validate the framework in both simulation and hardware on diverse multi-step pick-and-place tasks, demonstrating strong generalization to previously unseen commands, objects, and environments.