Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This study addresses the task of Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN), aiming to enable drones to interpret high-level linguistic instructions and execute long-horizon navigation in complex 3D environments. It establishes the first systematic methodological taxonomy, encompassing modular architectures, Vision-Language-Action (VLA) models, generative world models, and efficient deployment techniques, while integrating mainstream simulation platforms and evaluation protocols. The work proposes a novel paradigm that synergistically combines generative world models with VLA frameworks and outlines a forward-looking roadmap for multi-agent collaboration and aerial-ground coordination. Furthermore, it provides an in-depth analysis of core challenges—including the sim-to-real gap, dynamic environment perception, and linguistic ambiguity—offering the community standardized benchmarks and comprehensive guidance for future research.

Technology Category

Application Category

📝 Abstract

Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN) represents a pivotal challenge in embodied artificial intelligence, focused on enabling UAVs to interpret high-level human commands and execute long-horizon tasks in complex 3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to the current state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including Vision-Language Models (VLMs), Vision-Language-Action (VLA) models, and the emerging integration of generative world models with VLA architectures for physically-grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and evaluation metrics that facilitates standardized research. Furthermore, we conduct a critical analysis of the primary challenges impeding real-world deployment: the simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and the efficient deployment of large models on resource-constrained hardware. By synthesizing current benchmarks and limitations, this survey concludes by proposing a forward-looking research roadmap to guide future inquiry into key frontiers such as multi-agent swarm coordination and air-ground collaborative robotics.

Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation

Unmanned Aerial Vehicles

Embodied AI

3D Navigation

Language-Grounded Robotics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-and-Language Navigation

Unmanned Aerial Vehicles

Vision-Language-Action Models