History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

In large-scale urban environments, single-granularity vision-language navigation (VLN) methods struggle to simultaneously support global environmental reasoning and fine-grained local scene understanding for UAVs following natural language instructions. To address this, we propose a two-stage VLN framework that synergistically combines coarse-grained global localization with fine-grained action decision-making. Our key innovation is a history-augmented dual-stage Transformer architecture, incorporating a dynamically updated historical grid map as structured spatial memory, alongside spatial landmark encoding, historical context modeling, multi-scale visual feature alignment, and grid-based memory mechanisms. Evaluated on the manually annotated CityNav dataset, our method achieves significant improvements in navigation success rate and path efficiency. Ablation studies confirm the effectiveness of each component, and overall performance surpasses state-of-the-art single-granularity approaches.

Technology Category

Application Category

📝 Abstract

Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.

Problem

Research questions and friction points this paper is trying to address.

Balances global reasoning and local comprehension in UAV navigation

Integrates coarse-to-fine pipeline for target localization from instructions

Enhances spatial memory and data quality for improved navigation performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage transformer with coarse-to-fine navigation pipeline

Historical grid map for structured spatial memory

Manual refinement of dataset annotations for quality

🔎 Similar Papers

No similar papers found.