🤖 AI Summary
In large-scale urban environments, single-granularity vision-language navigation (VLN) methods struggle to simultaneously support global environmental reasoning and fine-grained local scene understanding for UAVs following natural language instructions. To address this, we propose a two-stage VLN framework that synergistically combines coarse-grained global localization with fine-grained action decision-making. Our key innovation is a history-augmented dual-stage Transformer architecture, incorporating a dynamically updated historical grid map as structured spatial memory, alongside spatial landmark encoding, historical context modeling, multi-scale visual feature alignment, and grid-based memory mechanisms. Evaluated on the manually annotated CityNav dataset, our method achieves significant improvements in navigation success rate and path efficiency. Ablation studies confirm the effectiveness of each component, and overall performance surpasses state-of-the-art single-granularity approaches.
📝 Abstract
Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.