AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing vision-language navigation systems for unmanned aerial vehicles rely on fine-grained instructions and struggle to achieve autonomous obstacle avoidance and path planning in unknown outdoor environments using only coarse language guidance. To this end, we propose AutoFly, an end-to-end vision-language-action model that incorporates a pseudo-depth encoder to enhance spatial perception and employs a two-stage progressive training strategy to effectively align visual, depth, linguistic, and action representations. Our approach establishes the first vision-language-action paradigm tailored for outdoor autonomous navigation and introduces a new dataset emphasizing continuous obstacle avoidance and autonomous decision-making. Experiments demonstrate that AutoFly outperforms the current state-of-the-art vision-language-action baseline by 3.9% in success rate and exhibits robust performance in both simulated and real-world environments.

Technology Category

Application Category

📝 Abstract
Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial reasoning, coupled with a progressive two-stage training strategy that effectively aligns visual, depth, and linguistic representations with action policies. Moreover, existing VLN datasets have fundamental limitations for real-world autonomous navigation, stemming from their heavy reliance on explicit instruction-following over autonomous decision-making and insufficient real-world data. To address these issues, we construct a novel autonomous navigation dataset that shifts the paradigm from instruction-following to autonomous behavior modeling through: (1) trajectory collection emphasizing continuous obstacle avoidance, autonomous planning, and recognition workflows; (2) comprehensive real-world data integration. Experimental results demonstrate that AutoFly achieves a 3.9% higher success rate compared to state-of-the-art VLA baselines, with consistent performance across simulated and real environments.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation
UAV Autonomous Navigation
Autonomous Decision-making
Obstacle Avoidance
Unknown Environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)
pseudo-depth encoder
autonomous UAV navigation
two-stage training
real-world navigation dataset
🔎 Similar Papers
No similar papers found.