🤖 AI Summary
This study addresses the poor zero-shot generalization of end-to-end visual navigation policies in unseen environments by systematically evaluating, through large-scale empirical analysis, how the scale and geographic diversity of training data affect the performance of map-free point-goal navigation strategies. Leveraging a crowdsourced video dataset spanning 161 locations across 35 countries (totaling 4,565 hours), the authors train navigation policies and evaluate their closed-loop control performance on 125 kilometers of real-world roads across four countries. The work reveals, for the first time, that geographic diversity is far more critical than total data volume; under noisy crowdsourced data, simple regression models outperform complex architectures; and performance approaching that of environment-specific training can be achieved solely through data diversity, with increasing the number of distinct locations reducing navigation error by approximately 15%.
📝 Abstract
Generalization of imitation-learned navigation policies to environments unseen in training remains a major challenge. We address this by conducting the first large-scale study of how data quantity and data diversity affect real-world generalization in end-to-end, map-free visual navigation. Using a curated 4,565-hour crowd-sourced dataset collected across 161 locations in 35 countries, we train policies for point goal navigation and evaluate their closed-loop control performance on sidewalk robots operating in four countries, covering 125 km of autonomous driving. Our results show that large-scale training data enables zero-shot navigation in unknown environments, approaching the performance of policies trained with environment-specific demonstrations. Critically, we find that data diversity is far more important than data quantity. Doubling the number of geographical locations in a training set decreases navigation errors by ~15%, while performance benefit from adding data from existing locations saturates with very little data. We also observe that, with noisy crowd-sourced data, simple regression-based models outperform generative and sequence-based architectures. We release our policies, evaluation setup and example videos on the project page.