VLN-Pilot: Large Vision-Language Model as an Autonomous Indoor Drone Operator

πŸ“… 2026-02-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of enabling autonomous drones to accurately interpret and execute free-form natural language instructions in GPS-denied indoor environments. It presents the first end-to-end indoor drone control system that directly leverages a large vision-language model (VLLM), integrating linguistic semantics with visual perception to achieve semantic-driven, context-aware high-level flight behaviors without task-specific engineering. The proposed framework unifies multimodal reasoning, visual navigation, obstacle avoidance, and dynamic response mechanisms within a self-built high-fidelity simulation environment. Experimental results demonstrate successful execution of complex, long-horizon, multi-objective navigation tasks, significantly improving both task success rates and system adaptability compared to conventional approaches.

Technology Category

Application Category

πŸ“ Abstract
This paper introduces VLN-Pilot, a novel framework in which a large Vision-and-Language Model (VLLM) assumes the role of a human pilot for indoor drone navigation. By leveraging the multimodal reasoning abilities of VLLMs, VLN-Pilot interprets free-form natural language instructions and grounds them in visual observations to plan and execute drone trajectories in GPS-denied indoor environments. Unlike traditional rule-based or geometric path-planning approaches, our framework integrates language-driven semantic understanding with visual perception, enabling context-aware, high-level flight behaviors with minimal task-specific engineering. VLN-Pilot supports fully autonomous instruction-following for drones by reasoning about spatial relationships, obstacle avoidance, and dynamic reactivity to unforeseen events. We validate our framework on a custom photorealistic indoor simulation benchmark and demonstrate the ability of the VLLM-driven agent to achieve high success rates on complex instruction-following tasks, including long-horizon navigation with multiple semantic targets. Experimental results highlight the promise of replacing remote drone pilots with a language-guided autonomous agent, opening avenues for scalable, human-friendly control of indoor UAVs in tasks such as inspection, search-and-rescue, and facility monitoring. Our results suggest that VLLM-based pilots may dramatically reduce operator workload while improving safety and mission flexibility in constrained indoor environments.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation
Autonomous Drone
Indoor Navigation
Natural Language Instruction
GPS-denied Environment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model
Autonomous Drone Navigation
Indoor UAV
Natural Language Instruction
Multimodal Reasoning
πŸ”Ž Similar Papers
No similar papers found.
Bessie Dominguez-Dager
Bessie Dominguez-Dager
PhD student, University of Alicante
Computer VisionDeep LearningMixed Reality
Sergio Suescun-Ferrandiz
Sergio Suescun-Ferrandiz
Investigador del grupo RoViT de la Universidad de Alicante
Inteligencia Artificial
F
FΓ©lix Escalona
University Institute for Computer Research, University of Alicante, Ctra. San Vicente del Raspeig SN, 03690, Alicante, SPAIN
F
Francisco GΓ³mez-Donoso
University Institute for Computer Research, University of Alicante, Ctra. San Vicente del Raspeig SN, 03690, Alicante, SPAIN
M
Miguel Cazorla
University Institute for Computer Research, University of Alicante, Ctra. San Vicente del Raspeig SN, 03690, Alicante, SPAIN