Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
This study addresses the task of Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN), aiming to enable drones to interpret high-level linguistic instructions and execute long-horizon navigation in complex 3D environments. It establishes the first systematic methodological taxonomy, encompassing modular architectures, Vision-Language-Action (VLA) models, generative world models, and efficient deployment techniques, while integrating mainstream simulation platforms and evaluation protocols. The work proposes a novel paradigm that synergistically combines generative world models with VLA frameworks and outlines a forward-looking roadmap for multi-agent collaboration and aerial-ground coordination. Furthermore, it provides an in-depth analysis of core challenges—including the sim-to-real gap, dynamic environment perception, and linguistic ambiguity—offering the community standardized benchmarks and comprehensive guidance for future research.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN) represents a pivotal challenge in embodied artificial intelligence, focused on enabling UAVs to interpret high-level human commands and execute long-horizon tasks in complex 3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to the current state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including Vision-Language Models (VLMs), Vision-Language-Action (VLA) models, and the emerging integration of generative world models with VLA architectures for physically-grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and evaluation metrics that facilitates standardized research. Furthermore, we conduct a critical analysis of the primary challenges impeding real-world deployment: the simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and the efficient deployment of large models on resource-constrained hardware. By synthesizing current benchmarks and limitations, this survey concludes by proposing a forward-looking research roadmap to guide future inquiry into key frontiers such as multi-agent swarm coordination and air-ground collaborative robotics.
Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation
Unmanned Aerial Vehicles
Embodied AI
3D Navigation
Language-Grounded Robotics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-and-Language Navigation
Unmanned Aerial Vehicles
Vision-Language-Action Models
Generative World Models
Embodied AI
🔎 Similar Papers
No similar papers found.
H
Hanxuan Chen
Autel Robotics, Shenzhen, China
Jie Zheng
Jie Zheng
Associate Professor, School of Information Science and Technology, ShanghaiTech University
Bioinformaticsartificial intelligencebiomedical data scienceAI for Sciencedrug discovery
Siqi Yang
Siqi Yang
University of Electronic Science and Technology of China
Generative Speech EnhancementAutomatic Speech RecognitionDiffusion Models
T
Tianle Zeng
Southern University of Science and Technology, Shenzhen, China
S
Siwei Feng
Autel Robotics, Shenzhen, China
S
Songsheng Cheng
Autel Robotics, Shenzhen, China
R
Ruilong Ren
Autel Robotics, Shenzhen, China
Hanzhong Guo
Hanzhong Guo
University of Hong Kong
Diffusion ModelsModel Efficiency
Shuai Yuan
Shuai Yuan
University of Electronic Science and Technology of China
System securityData security
X
Xiangyue Wang
Autel Robotics, Shenzhen, China
K
Kangli Wang
Autel Robotics, Shenzhen, China
J
Ji Pei
Autel Robotics, Shenzhen, China