๐ค AI Summary
Vision-language navigation (VLN) for unmanned aerial vehicles (UAVs) suffers from poor performance in long-trajectory, high-mobility outdoor scenarios and heavily relies on human intervention and fine-grained natural-language instructions.
Method: We propose DuAl-VLN, a novel dual-altitude collaborative VLN task where high-altitude and low-altitude UAVs specialize in global perception and precise execution, respectively, coordinating efficiently via minimal coordinate exchange. To support this paradigm, we introduce HaL-13kโthe first large-scale outdoor dual-altitude VLN datasetโand design Pilot-LLM, a multimodal large language model for cross-view target reasoning, coupled with a lightweight multi-stage policy model for robust low-altitude navigation.
Results: Experiments on HaL-13k demonstrate substantial improvements in long-trajectory navigation success rate, strong generalization across unseen environments, and significant reduction in dependence on detailed instructions and human supervision.
๐ Abstract
Aerial Vision-and-Language Navigation (VLN) is an emerging task that enables Unmanned Aerial Vehicles (UAVs) to navigate outdoor environments using natural language instructions and visual cues. However, due to the extended trajectories and complex maneuverability of UAVs, achieving reliable UAV-VLN performance is challenging and often requires human intervention or overly detailed instructions. To harness the advantages of UAVs' high mobility, which could provide multi-grained perspectives, while maintaining a manageable motion space for learning, we introduce a novel task called Dual-Altitude UAV Collaborative VLN (DuAl-VLN). In this task, two UAVs operate at distinct altitudes: a high-altitude UAV responsible for broad environmental reasoning, and a low-altitude UAV tasked with precise navigation. To support the training and evaluation of the DuAl-VLN, we construct the HaL-13k, a dataset comprising 13,838 collaborative high-low UAV demonstration trajectories, each paired with target-oriented language instructions. This dataset includes both unseen maps and an unseen object validation set to systematically evaluate the model's generalization capabilities across novel environments and unfamiliar targets. To consolidate their complementary strengths, we propose a dual-UAV collaborative VLN framework, AeroDuo, where the high-altitude UAV integrates a multimodal large language model (Pilot-LLM) for target reasoning, while the low-altitude UAV employs a lightweight multi-stage policy for navigation and target grounding. The two UAVs work collaboratively and only exchange minimal coordinate information to ensure efficiency.