CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Urban aerial vision-language navigation (VLN) faces key challenges: absence of predefined maps, exponential growth of the long-horizon action space, and difficulty in grounding natural language instructions. To address these, we propose the first hierarchical semantic planning framework tailored for unmanned aerial vehicles (UAVs). Our method introduces a Hierarchical Semantic Planning Module (HSPM) for multi-granularity goal decomposition and a structured trajectory reuse mechanism grounded in global topological memory. We further integrate large language model (LLM)-driven semantic parsing with end-to-end continuous control policy learning. Evaluated in a photorealistic urban simulation environment, our approach achieves state-of-the-art (SOTA) performance—improving long-horizon navigation success rate by +23.6% and significantly enhancing cross-scene generalization. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Aerial vision-and-language navigation (VLN), requiring drones to interpret natural language instructions and navigate complex urban environments, emerges as a critical embodied AI challenge that bridges human-robot interaction, 3D spatial reasoning, and real-world deployment. Although existing ground VLN agents achieved notable results in indoor and outdoor settings, they struggle in aerial VLN due to the absence of predefined navigation graphs and the exponentially expanding action space in long-horizon exploration. In this work, we propose extbf{CityNavAgent}, a large language model (LLM)-empowered agent that significantly reduces the navigation complexity for urban aerial VLN. Specifically, we design a hierarchical semantic planning module (HSPM) that decomposes the long-horizon task into sub-goals with different semantic levels. The agent reaches the target progressively by achieving sub-goals with different capacities of the LLM. Additionally, a global memory module storing historical trajectories into a topological graph is developed to simplify navigation for visited targets. Extensive benchmark experiments show that our method achieves state-of-the-art performance with significant improvement. Further experiments demonstrate the effectiveness of different modules of CityNavAgent for aerial VLN in continuous city environments. The code is available at href{https://github.com/VinceOuti/CityNavAgent}{link}.
Problem

Research questions and friction points this paper is trying to address.

Aerial VLN for drones in complex urban environments
Hierarchical semantic planning for long-horizon navigation
Global memory module to simplify visited targets
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-empowered agent reduces aerial navigation complexity
Hierarchical semantic planning decomposes long-horizon tasks
Global memory module stores historical trajectories
🔎 Similar Papers
No similar papers found.
Weichen Zhang
Weichen Zhang
PhD, University of Sydney
Computer VisionDeep LearningTransfer LearningDomain Adaptation
C
Chen Gao
Tsinghua University
S
Shiquan Yu
Tsinghua University
R
Ruiying Peng
Tsinghua University
Baining Zhao
Baining Zhao
Tsinghua University
Q
Qian Zhang
Tsinghua University
Jinqiang Cui
Jinqiang Cui
PCL
LLM/VLM+Multi-robots system
X
Xinlei Chen
Tsinghua University
Y
Yong Li
Tsinghua University