🤖 AI Summary
Vision-language navigation (VLN) for unmanned aerial vehicles (UAVs) suffers from poor generalization and limitations imposed by discrete action spaces. Method: This paper introduces the first end-to-end monocular vision-language continuous navigation framework, integrating large language model (LLM)-based instruction rewriting, vision-language model (VLM)-driven cross-modal target retrieval, and continuous velocity trajectory planning—eliminating reliance on localization/ranging sensors or explicit maps. Contribution/Results: It breaks from conventional discrete-action paradigms by enabling direct mapping from natural language instructions to continuous velocity control, supporting open-vocabulary understanding and zero-shot cross-environment transfer. Experiments demonstrate comprehensive superiority over baselines in zero-shot transfer across multiple simulated environments; significantly improved navigation success rates for both direct and indirect instructions in real indoor/outdoor settings; and a 32.7% gain in open-vocabulary target matching accuracy.
📝 Abstract
Vision-and-language navigation (VLN) is a long-standing challenge in autonomous robotics, aiming to empower agents with the ability to follow human instructions while navigating complex environments. Two key bottlenecks remain in this field: generalization to out-of-distribution environments and reliance on fixed discrete action spaces. To address these challenges, we propose Vision-Language Fly (VLFly), a framework tailored for Unmanned Aerial Vehicles (UAVs) to execute language-guided flight. Without the requirement for localization or active ranging sensors, VLFly outputs continuous velocity commands purely from egocentric observations captured by an onboard monocular camera. The VLFly integrates three modules: an instruction encoder based on a large language model (LLM) that reformulates high-level language into structured prompts, a goal retriever powered by a vision-language model (VLM) that matches these prompts to goal images via vision-language similarity, and a waypoint planner that generates executable trajectories for real-time UAV control. VLFly is evaluated across diverse simulation environments without additional fine-tuning and consistently outperforms all baselines. Moreover, real-world VLN tasks in indoor and outdoor environments under direct and indirect instructions demonstrate that VLFly achieves robust open-vocabulary goal understanding and generalized navigation capabilities, even in the presence of abstract language input.