AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation

๐Ÿ“… 2025-08-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Vision-language navigation (VLN) for unmanned aerial vehicles (UAVs) suffers from poor performance in long-trajectory, high-mobility outdoor scenarios and heavily relies on human intervention and fine-grained natural-language instructions. Method: We propose DuAl-VLN, a novel dual-altitude collaborative VLN task where high-altitude and low-altitude UAVs specialize in global perception and precise execution, respectively, coordinating efficiently via minimal coordinate exchange. To support this paradigm, we introduce HaL-13kโ€”the first large-scale outdoor dual-altitude VLN datasetโ€”and design Pilot-LLM, a multimodal large language model for cross-view target reasoning, coupled with a lightweight multi-stage policy model for robust low-altitude navigation. Results: Experiments on HaL-13k demonstrate substantial improvements in long-trajectory navigation success rate, strong generalization across unseen environments, and significant reduction in dependence on detailed instructions and human supervision.

Technology Category

Application Category

๐Ÿ“ Abstract
Aerial Vision-and-Language Navigation (VLN) is an emerging task that enables Unmanned Aerial Vehicles (UAVs) to navigate outdoor environments using natural language instructions and visual cues. However, due to the extended trajectories and complex maneuverability of UAVs, achieving reliable UAV-VLN performance is challenging and often requires human intervention or overly detailed instructions. To harness the advantages of UAVs' high mobility, which could provide multi-grained perspectives, while maintaining a manageable motion space for learning, we introduce a novel task called Dual-Altitude UAV Collaborative VLN (DuAl-VLN). In this task, two UAVs operate at distinct altitudes: a high-altitude UAV responsible for broad environmental reasoning, and a low-altitude UAV tasked with precise navigation. To support the training and evaluation of the DuAl-VLN, we construct the HaL-13k, a dataset comprising 13,838 collaborative high-low UAV demonstration trajectories, each paired with target-oriented language instructions. This dataset includes both unseen maps and an unseen object validation set to systematically evaluate the model's generalization capabilities across novel environments and unfamiliar targets. To consolidate their complementary strengths, we propose a dual-UAV collaborative VLN framework, AeroDuo, where the high-altitude UAV integrates a multimodal large language model (Pilot-LLM) for target reasoning, while the low-altitude UAV employs a lightweight multi-stage policy for navigation and target grounding. The two UAVs work collaboratively and only exchange minimal coordinate information to ensure efficiency.
Problem

Research questions and friction points this paper is trying to address.

Enabling UAVs to navigate using natural language and vision
Addressing complex UAV maneuverability with extended trajectories
Developing collaborative dual-altitude UAVs for improved navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-UAV collaborative VLN framework
High-altitude UAV uses multimodal LLM
Low-altitude UAV employs lightweight policy
๐Ÿ”Ž Similar Papers