🤖 AI Summary
This work addresses the limited explicit reasoning capability of unmanned aerial vehicles (UAVs) in performing vision-and-language navigation (VLN) within complex outdoor environments. To this end, the authors propose an end-to-end navigation framework that integrates a chain-of-thought (CoT) reasoning mechanism. The model jointly maps first-person visual observations and natural language instructions into continuous navigation actions through a two-stage training strategy—supervised fine-tuning followed by reinforcement fine-tuning. The key contributions include the first adaptation of chain-of-thought reasoning to UAV-based VLN to enhance decision interpretability, and the creation of the first outdoor UAV-VLN dataset tailored to urban architectural settings. Experimental results demonstrate that the proposed method significantly outperforms baseline approaches in unseen test environments, thereby improving both the robustness and execution efficiency of UAV navigation in complex outdoor scenarios.
📝 Abstract
Vision-Language Navigation aims to enable agents to understand natural language instructions and carry out appropriate navigation actions in real-world environments. Most work focuses on indoor settings, with little research in complex outdoor scenes. Current UAV Vision-and-Language Navigation models typically act as black boxes without explicit reasoning. We introduce FreeFly-thinking, an end-to-end VLN framework that converts the UAV agent's egocentric images and language instructions into a series of actions, inspired by environment of urban architecture proposed by OpenFly. We first construct a UAV dataset for navigation task, and then performing natural language chain of thought. We adopt a two-stage training strategy: Supervised fine-tuning and Reinforcement fine-tuning. Experiments on unseen test demonstrate a strong performance, presenting robustness and efficiency in UAV navigation issue.