CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information

📅 2024-06-20

🏛️ arXiv.org

📈 Citations: 23

✨ Influential: 4

career value

190K/year

🤖 AI Summary

Existing vision-language navigation (VLN) research focuses predominantly on indoor and ground-level outdoor scenes; city-scale language-guided aerial navigation remains underexplored due to the absence of large-scale datasets and geo-visual fusion frameworks. This paper introduces CityNav, the first real-world, city-scale language-guided aerial navigation dataset, comprising 32K natural-language instructions and corresponding human-demonstrated trajectories, constructed via a web-based 3D simulation platform and annotated with real-world landmark names and geographic coordinates. We establish the first city-scale vision-language aerial navigation benchmark and propose a navigation agent architecture incorporating 2D spatial maps. Behavioral cloning experiments reveal that human demonstration supervision substantially outperforms shortest-path supervision, and explicit spatial map representation significantly enhances navigation performance. Empirical results expose a substantial performance gap between human and agent navigation, underscoring the task’s complexity and its scientific significance.

Technology Category

Application Category

📝 Abstract

Vision-and-language navigation (VLN) aims to guide autonomous agents through real-world environments by integrating visual and linguistic cues. Despite notable advancements in ground-level navigation, the exploration of aerial navigation using these modalities remains limited. This gap primarily arises from a lack of suitable resources for real-world, city-scale aerial navigation studies. To remedy this gap, we introduce CityNav, a novel dataset explicitly designed for language-guided aerial navigation in photorealistic 3D environments of real cities. CityNav comprises 32k natural language descriptions paired with human demonstration trajectories, collected via a newly developed web-based 3D simulator. Each description identifies a navigation goal, utilizing the names and locations of landmarks within actual cities. As an initial step toward addressing this challenge, we provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions. We have benchmarked the latest aerial navigation methods alongside our proposed baseline model on the CityNav dataset. The findings are revealing: (i) our aerial agent model trained on human demonstration trajectories, outperform those trained on shortest path trajectories by a large margin; (ii) incorporating 2D spatial map information markedly and robustly enhances navigation performance at a city scale; (iii) despite the use of map information, our challenging CityNav dataset reveals a persistent performance gap between our baseline models and human performance. To foster further research in aerial VLN, we have made the dataset and code available at https://water-cookie.github.io/city-nav-proj/

Problem

Research questions and friction points this paper is trying to address.

Addressing aerial navigation challenges in real-world cities

Integrating visual and geographic information for VLN

Overcoming dataset limitations for aerial VLN benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale real-world aerial VLN dataset

Geographic semantic maps as auxiliary input

Integration of visual and geographic information

🔎 Similar Papers

AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models