🤖 AI Summary
This work addresses the challenge of vision-language navigation for unmanned aerial vehicles (UAVs) in dynamic 3D environments, where existing approaches often rely on dense semantic priors or auxiliary object detectors. The authors propose a minimalist, end-to-end vision–language–action framework that directly maps raw visual inputs and ambiguous linguistic instructions to continuous physical control signals, enabling autonomous flight and precise landing without external guidance. The method introduces a novel dual-view perception strategy and an onboard sensor–based coarse directional cue mechanism, unifying these components into a single three-degree-of-freedom motion control space while entirely eliminating dependence on object detectors or dense semantic priors. Evaluated on the TravelUAV benchmark, the approach achieves a success rate nearly three times that of leading baselines in unseen scenarios, demonstrating strong generalization and effective autonomous navigation capabilities.
📝 Abstract
Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on dense oracle guidance. Ultimately, we formulate a unified control space that integrates continuous 3-Degree-of-Freedom (3-DoF) kinematic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance in seen environments. Furthermore, it exhibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a minimalist, autonomy-centric paradigm captures more robust visual-motor representations than complex modular systems.