UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing urban simulation environments struggle to balance scalability with real-world complexity, hindering embodied AI training (e.g., delivery or quadruped robots). This paper introduces UrbanVerse: the first video-driven, end-to-end real-to-simulation system that automatically generates interactive, metrically accurate, high-fidelity urban scenes in IsaacSim. It parses crowdsourced city tour videos and integrates 3D reconstruction, semantic segmentation, asset retrieval, and physics-based simulation. Key contributions include the UrbanVerse-100K asset library and the UrbanVerse-Gen generation pipeline, enabling semantically consistent and physically plausible scene instantiation. Experiments yield 160 high-quality scenes and 10 standardized benchmarks; navigation policies exhibit power-law scalability. Zero-shot sim-to-real transfer achieves a 30.1% improvement in success rate, and real-city 300-meter delivery tasks require only an average of two human interventions.

Technology Category

Application Category

📝 Abstract
Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.
Problem

Research questions and friction points this paper is trying to address.

Scaling urban simulation using crowd-sourced city-tour videos
Creating physics-aware interactive scenes from real-world data
Improving urban navigation policies through realistic simulation training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts city-tour videos into interactive simulation scenes
Automatically extracts scene layouts and instantiates 3D simulations
Creates physics-aware urban environments from crowd-sourced videos
🔎 Similar Papers
No similar papers found.