History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In large-scale urban environments, single-granularity vision-language navigation (VLN) methods struggle to simultaneously support global environmental reasoning and fine-grained local scene understanding for UAVs following natural language instructions. To address this, we propose a two-stage VLN framework that synergistically combines coarse-grained global localization with fine-grained action decision-making. Our key innovation is a history-augmented dual-stage Transformer architecture, incorporating a dynamically updated historical grid map as structured spatial memory, alongside spatial landmark encoding, historical context modeling, multi-scale visual feature alignment, and grid-based memory mechanisms. Evaluated on the manually annotated CityNav dataset, our method achieves significant improvements in navigation success rate and path efficiency. Ablation studies confirm the effectiveness of each component, and overall performance surpasses state-of-the-art single-granularity approaches.

Technology Category

Application Category

📝 Abstract
Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.
Problem

Research questions and friction points this paper is trying to address.

Balances global reasoning and local comprehension in UAV navigation
Integrates coarse-to-fine pipeline for target localization from instructions
Enhances spatial memory and data quality for improved navigation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage transformer with coarse-to-fine navigation pipeline
Historical grid map for structured spatial memory
Manual refinement of dataset annotations for quality
🔎 Similar Papers
No similar papers found.
Xichen Ding
Xichen Ding
Ant Group
machine learningnlprecommendation
J
Jianzhe Gao
The State Key Lab of Brain-Machine Intelligence, Zhejiang University
C
Cong Pan
College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics; Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education, China
Wenguan Wang
Wenguan Wang
Zhejiang University
Neural-Symbolic AIEmbodied AIAutonomous CarsComputer VisionArtificial Intelligence
Jie Qin
Jie Qin
Professor, Nanjing University of Aeronautics and Astronautics
Computer VisionMachine LearningPattern Recognition