GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Current generative models for autonomous driving simulation struggle to model multi-agent interactions, fine-grained action control, and cross-camera geometric consistency. To address these challenges, this paper proposes the first controllable multi-view generative world model tailored for autonomous driving. Built upon a latent diffusion architecture, it innovatively integrates structured conditional encoding—including vehicle dynamics, agent configurations, and environmental semantics—cross-view spatiotemporal attention, and semantic latent embedding injection from external driving models. Our approach unifies multi-agent collaborative modeling, millisecond-level action-controllable video generation, and geometrically consistent multi-camera video synthesis, enabling rare-scenario extrapolation. Evaluated on real-world driving distributions across the UK, US, and Germany, it generates high-resolution, spatiotemporally coherent multi-view videos, significantly improving simulation diversity and physical fidelity. The model has been deployed in production-grade autonomous driving system development and validation.

Technology Category

Application Category

📝 Abstract

Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at https://wayve.ai/thinking/gaia-2.

Problem

Research questions and friction points this paper is trying to address.

Address multi-agent interactions in autonomous driving

Ensure multi-camera consistency in generated videos

Simulate diverse driving scenarios with fine-grained control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent diffusion model for driving simulation

Multi-camera consistent video generation

Structured input conditioning for control

🔎 Similar Papers

Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models