🤖 AI Summary
Current generative models for autonomous driving simulation struggle to model multi-agent interactions, fine-grained action control, and cross-camera geometric consistency. To address these challenges, this paper proposes the first controllable multi-view generative world model tailored for autonomous driving. Built upon a latent diffusion architecture, it innovatively integrates structured conditional encoding—including vehicle dynamics, agent configurations, and environmental semantics—cross-view spatiotemporal attention, and semantic latent embedding injection from external driving models. Our approach unifies multi-agent collaborative modeling, millisecond-level action-controllable video generation, and geometrically consistent multi-camera video synthesis, enabling rare-scenario extrapolation. Evaluated on real-world driving distributions across the UK, US, and Germany, it generates high-resolution, spatiotemporally coherent multi-view videos, significantly improving simulation diversity and physical fidelity. The model has been deployed in production-grade autonomous driving system development and validation.
📝 Abstract
Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at https://wayve.ai/thinking/gaia-2.