🤖 AI Summary
Existing urban simulation environments struggle to balance scalability with real-world complexity, hindering embodied AI training (e.g., delivery or quadruped robots). This paper introduces UrbanVerse: the first video-driven, end-to-end real-to-simulation system that automatically generates interactive, metrically accurate, high-fidelity urban scenes in IsaacSim. It parses crowdsourced city tour videos and integrates 3D reconstruction, semantic segmentation, asset retrieval, and physics-based simulation. Key contributions include the UrbanVerse-100K asset library and the UrbanVerse-Gen generation pipeline, enabling semantically consistent and physically plausible scene instantiation. Experiments yield 160 high-quality scenes and 10 standardized benchmarks; navigation policies exhibit power-law scalability. Zero-shot sim-to-real transfer achieves a 30.1% improvement in success rate, and real-city 300-meter delivery tasks require only an average of two human interventions.
📝 Abstract
Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.