Grounding World Simulation Models in a Real-World Metropolis

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Existing generative world models struggle to faithfully reconstruct real-world cities and their dynamic scenes with high fidelity. This work proposes SWM, the first world model capable of month-scale simulation of real urban environments, leveraging a retrieval-augmented autoregressive video generation framework that integrates street-view imagery to achieve photorealistic reconstructions of cities such as Seoul. Key innovations include cross-temporal pairing, a virtual lookahead anchor mechanism, and a view interpolation pipeline, which collectively address temporal misalignment between reference images and dynamic content, insufficient trajectory diversity, and data sparsity. Experiments in Seoul, Busan, and Ann Arbor demonstrate that SWM generates spatially accurate and temporally coherent videos over trajectories spanning hundreds of meters, supports diverse camera motions, and enables text-guided scene editing, significantly outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.

Problem

Research questions and friction points this paper is trying to address.

world simulation model

real-world grounding

autoregressive video generation

temporal consistency

urban environment

Innovation

Methods, ideas, or system contributions that make the work stand out.

world model

retrieval-augmented generation

view interpolation