AERMANI-VLM: Structured Prompting and Reasoning for Aerial Manipulation with Vision Language Models

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Vision-language models (VLMs) exhibit instruction inconsistency, hallucination, and dynamically infeasible outputs when directly deployed to control aerial manipulators, compromising safety and reliability. Method: We propose the first VLM-based framework for aerial manipulation, decoupling high-level reasoning from low-level control. Our approach integrates structured prompt engineering, natural language chain-of-thought reasoning, and a discrete safety skill library—requiring no model fine-tuning—to generate interpretable, temporally consistent, and hallucination-resistant motion plans. Contribution/Results: To our knowledge, this is the first work to adapt pre-trained VLMs to aerial manipulation. Evaluated in both simulation and on physical hardware, our framework demonstrates strong generalization across unseen instructions, objects, and environments in multi-step pick-and-place tasks, significantly improving task success rate and operational safety.

Technology Category

Application Category

📝 Abstract

The rapid progress of vision--language models (VLMs) has sparked growing interest in robotic control, where natural language can express the operation goals while visual feedback links perception to action. However, directly deploying VLM-driven policies on aerial manipulators remains unsafe and unreliable since the generated actions are often inconsistent, hallucination-prone, and dynamically infeasible for flight. In this work, we present AERMANI-VLM, the first framework to adapt pretrained VLMs for aerial manipulation by separating high-level reasoning from low-level control, without any task-specific fine-tuning. Our framework encodes natural language instructions, task context, and safety constraints into a structured prompt that guides the model to generate a step-by-step reasoning trace in natural language. This reasoning output is used to select from a predefined library of discrete, flight-safe skills, ensuring interpretable and temporally consistent execution. By decoupling symbolic reasoning from physical action, AERMANI-VLM mitigates hallucinated commands and prevents unsafe behavior, enabling robust task completion. We validate the framework in both simulation and hardware on diverse multi-step pick-and-place tasks, demonstrating strong generalization to previously unseen commands, objects, and environments.

Problem

Research questions and friction points this paper is trying to address.

Ensuring safe VLM-driven control for aerial manipulators

Mitigating inconsistent and hallucinated robot commands

Separating reasoning from control for robust task execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured prompting separates reasoning from control

Discrete skill library ensures flight-safe execution

Symbolic reasoning prevents hallucinated unsafe commands

🔎 Similar Papers

Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs

2024-09-23arXiv.orgCitations: 0

AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models

2024-08-28arXiv.orgCitations: 10