CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information

πŸ“… 2024-06-20
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 23
✨ Influential: 4
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language navigation (VLN) research focuses predominantly on indoor and ground-level outdoor scenes; city-scale language-guided aerial navigation remains underexplored due to the absence of large-scale datasets and geo-visual fusion frameworks. This paper introduces CityNav, the first real-world, city-scale language-guided aerial navigation dataset, comprising 32K natural-language instructions and corresponding human-demonstrated trajectories, constructed via a web-based 3D simulation platform and annotated with real-world landmark names and geographic coordinates. We establish the first city-scale vision-language aerial navigation benchmark and propose a navigation agent architecture incorporating 2D spatial maps. Behavioral cloning experiments reveal that human demonstration supervision substantially outperforms shortest-path supervision, and explicit spatial map representation significantly enhances navigation performance. Empirical results expose a substantial performance gap between human and agent navigation, underscoring the task’s complexity and its scientific significance.

Technology Category

Application Category

πŸ“ Abstract
Vision-and-language navigation (VLN) aims to guide autonomous agents through real-world environments by integrating visual and linguistic cues. Despite notable advancements in ground-level navigation, the exploration of aerial navigation using these modalities remains limited. This gap primarily arises from a lack of suitable resources for real-world, city-scale aerial navigation studies. To remedy this gap, we introduce CityNav, a novel dataset explicitly designed for language-guided aerial navigation in photorealistic 3D environments of real cities. CityNav comprises 32k natural language descriptions paired with human demonstration trajectories, collected via a newly developed web-based 3D simulator. Each description identifies a navigation goal, utilizing the names and locations of landmarks within actual cities. As an initial step toward addressing this challenge, we provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions. We have benchmarked the latest aerial navigation methods alongside our proposed baseline model on the CityNav dataset. The findings are revealing: (i) our aerial agent model trained on human demonstration trajectories, outperform those trained on shortest path trajectories by a large margin; (ii) incorporating 2D spatial map information markedly and robustly enhances navigation performance at a city scale; (iii) despite the use of map information, our challenging CityNav dataset reveals a persistent performance gap between our baseline models and human performance. To foster further research in aerial VLN, we have made the dataset and code available at https://water-cookie.github.io/city-nav-proj/
Problem

Research questions and friction points this paper is trying to address.

Addressing aerial navigation challenges in real-world cities
Integrating visual and geographic information for VLN
Overcoming dataset limitations for aerial VLN benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale real-world aerial VLN dataset
Geographic semantic maps as auxiliary input
Integration of visual and geographic information
J
Jungdae Lee
Tokyo Institute of Technology
Taiki Miyanishi
Taiki Miyanishi
The University of Tokyo
Computer VisionInternet of ThingsInformation Retrieval
Shuhei Kurita
Shuhei Kurita
National Institute of Informatics
Deep LearningLarge Language ModelsComputer Vision
K
Koya Sakamoto
ATR, Kyoto University
D
Daich Azuma
Sony Semiconductor Solutions
Y
Yutaka Matsuo
The University of Tokyo
N
Nakamasa Inoue
Tokyo Institute of Technology