๐ค AI Summary
To address the challenges of massive scale, computational complexity, and inconsistent evaluation protocols in global road network data, this paper introduces OSM+, the first open-source billion-node global road graph dataset. Leveraging a 5,000-core cloud cluster, we implement distributed cleaning, fusion, and multimodal spatiotemporal alignment of OpenStreetMap data. Methodologically, we propose a scalable road graph construction framework that integrates graph neural networks with spatial database techniques, enabling efficient geospatial querying and foundation model training. Our contributions are threefold: (1) releasing a new traffic forecasting benchmark covering 31 cities and a large-scale traffic control dataset for six megacities; (2) enabling algorithm validation at the thousand-intersection scale, achieving breakthroughs in multi-agent coordination and system scalability; and (3) substantially expanding experimental scale and evaluation comprehensiveness for urban computing tasksโincluding traffic prediction, boundary detection, and policy simulation.
๐ Abstract
Road network data can provide rich information about cities and thus become the base for various urban research. However, processing large volume world-wide road network data requires intensive computing resources and the processed results might be different to be unified for testing downstream tasks. Therefore, in this paper, we process the OpenStreetMap data via a distributed computing of 5,000 cores on cloud services and release a structured world-wide 1-billion-vertex road network graph dataset with high accessibility (opensource and downloadable to the whole world) and usability (open-box graph structure and easy spatial query interface). To demonstrate how this dataset can be utilized easily, we present three illustrative use cases, including traffic prediction, city boundary detection and traffic policy control, and conduct extensive experiments for these three tasks. (1) For the well-investigated traffic prediction tasks, we release a new benchmark with 31 cities (traffic data processed and combined with our released OSM+ road network dataset), to provide much larger spatial coverage and more comprehensive evaluation of compared algorithms than the previously frequently-used datasets. This new benchmark will push the algorithms on their scalability from hundreds of road network intersections to thousands of intersections. (2) While for the more advanced traffic policy control task which requires interaction with the road network, we release a new 6 city datasets with much larger scale than the previous datasets. This brings new challenge for thousand-scale multi-agent coordination. (3) Along with the OSM+ dataset, the release of data converters facilitates the integration of multimodal spatial-temporal data for geospatial foundation model training, thereby expediting the process of uncovering compelling scientific insights. PVLDB Reference Forma