CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

📅 2024-11-26
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of embodied navigation—e.g., last-mile delivery robots—in dynamic, map-free, and non-lane-based urban environments, this paper proposes the first fully web-scale video–driven imitation learning framework that requires no manual annotation. Leveraging thousands of hours of urban walking and driving videos crawled from the web, our method introduces a novel video action parsing and spatiotemporal feature modeling pipeline to automatically extract action supervision signals directly from raw video. We further develop an end-to-end general-purpose urban navigation policy model and enhance its adaptability via multi-scenario generalization training. Evaluated on a challenging urban navigation benchmark, our approach significantly outperforms existing methods, demonstrating superior plausibility and robustness in critical tasks including pedestrian avoidance, intersection decision-making, and curb detection.

Technology Category

Application Category

📝 Abstract
Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings. Project homepage is at https://ai4ce.github.io/CityWalker/.
Problem

Research questions and friction points this paper is trying to address.

Learning human-like urban navigation from web videos
Overcoming map-free and off-street navigation challenges
Developing scalable imitation learning without costly annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training agents on web-scale city videos
Scalable data processing without annotations
Large-scale imitation learning for navigation
🔎 Similar Papers
No similar papers found.
X
Xinhao Liu
New York University
J
Jintong Li
New York University
Yichen Jiang
Yichen Jiang
Apple AI/ML
NLPAIMachine Learning
N
Niranjan Sujay
New York University
Z
Zhicheng Yang
New York University
Juexiao Zhang
Juexiao Zhang
CS PhD student at New York Univeristy
Machine LearningComputer VisionRobotics
J
John Abanes
New York University
J
Jing Zhang
New York University
C
Chen Feng
New York University