π€ AI Summary
To address the consistency gap arising from decoupled scene prediction and motion planning in autonomous driving world models, this paper proposes DriveLaWβthe first end-to-end unified paradigm that jointly models video generation and trajectory planning within a shared latent space. Specifically, a diffusion-driven latent-space video generator (DriveLaW-Video) directly conditions a diffusion-based planner (DriveLaW-Act), ensuring intrinsic alignment between predicted future scenes and corresponding action decisions. We introduce a novel three-stage progressive training strategy to co-optimize generative fidelity and planning reliability. On standard benchmarks, DriveLaW reduces video prediction FID by 33.3% and FVD by 1.8%. In NAVSIM planning tasks, it achieves state-of-the-art performance, with significant improvements in safety-critical behavior and generalization to long-tail scenarios.
π Abstract
World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.