CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

📅 2024-11-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address the insufficient robustness of embodied navigation—e.g., last-mile delivery robots—in dynamic, map-free, and non-lane-based urban environments, this paper proposes the first fully web-scale video–driven imitation learning framework that requires no manual annotation. Leveraging thousands of hours of urban walking and driving videos crawled from the web, our method introduces a novel video action parsing and spatiotemporal feature modeling pipeline to automatically extract action supervision signals directly from raw video. We further develop an end-to-end general-purpose urban navigation policy model and enhance its adaptability via multi-scenario generalization training. Evaluated on a challenging urban navigation benchmark, our approach significantly outperforms existing methods, demonstrating superior plausibility and robustness in critical tasks including pedestrian avoidance, intersection decision-making, and curb detection.

Technology Category

Application Category

📝 Abstract

Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings. Project homepage is at https://ai4ce.github.io/CityWalker/.

Problem

Research questions and friction points this paper is trying to address.

Learning human-like urban navigation from web videos

Overcoming map-free and off-street navigation challenges

Developing scalable imitation learning without costly annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training agents on web-scale city videos

Scalable data processing without annotations

Large-scale imitation learning for navigation

🔎 Similar Papers

No similar papers found.