🤖 AI Summary
This study addresses the task of Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN), aiming to enable drones to interpret high-level linguistic instructions and execute long-horizon navigation in complex 3D environments. It establishes the first systematic methodological taxonomy, encompassing modular architectures, Vision-Language-Action (VLA) models, generative world models, and efficient deployment techniques, while integrating mainstream simulation platforms and evaluation protocols. The work proposes a novel paradigm that synergistically combines generative world models with VLA frameworks and outlines a forward-looking roadmap for multi-agent collaboration and aerial-ground coordination. Furthermore, it provides an in-depth analysis of core challenges—including the sim-to-real gap, dynamic environment perception, and linguistic ambiguity—offering the community standardized benchmarks and comprehensive guidance for future research.
📝 Abstract
Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN) represents a pivotal challenge in embodied artificial intelligence, focused on enabling UAVs to interpret high-level human commands and execute long-horizon tasks in complex 3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to the current state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including Vision-Language Models (VLMs), Vision-Language-Action (VLA) models, and the emerging integration of generative world models with VLA architectures for physically-grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and evaluation metrics that facilitates standardized research. Furthermore, we conduct a critical analysis of the primary challenges impeding real-world deployment: the simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and the efficient deployment of large models on resource-constrained hardware. By synthesizing current benchmarks and limitations, this survey concludes by proposing a forward-looking research roadmap to guide future inquiry into key frontiers such as multi-agent swarm coordination and air-ground collaborative robotics.