DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

πŸ“… 2025-12-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the consistency gap arising from decoupled scene prediction and motion planning in autonomous driving world models, this paper proposes DriveLaWβ€”the first end-to-end unified paradigm that jointly models video generation and trajectory planning within a shared latent space. Specifically, a diffusion-driven latent-space video generator (DriveLaW-Video) directly conditions a diffusion-based planner (DriveLaW-Act), ensuring intrinsic alignment between predicted future scenes and corresponding action decisions. We introduce a novel three-stage progressive training strategy to co-optimize generative fidelity and planning reliability. On standard benchmarks, DriveLaW reduces video prediction FID by 33.3% and FVD by 1.8%. In NAVSIM planning tasks, it achieves state-of-the-art performance, with significant improvements in safety-critical behavior and generalization to long-tail scenarios.

Technology Category

Application Category

πŸ“ Abstract
World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.
Problem

Research questions and friction points this paper is trying to address.

Unifies video generation and motion planning for autonomous driving
Ensures consistency between future scenario prediction and trajectory planning
Achieves state-of-the-art performance in both video forecasting and planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies video generation and motion planning in one model
Injects latent video representations directly into trajectory planner
Uses three-stage progressive training for both components
πŸ”Ž Similar Papers
No similar papers found.
T
Tianze Xia
Huazhong University of Science and Technology
Y
Yongkang Li
Huazhong University of Science and Technology
Lijun Zhou
Lijun Zhou
Xiaomi Corporation
Jingfeng Yao
Jingfeng Yao
Huazhong University of Science and Technology
computer visiongenerative models
K
Kaixin Xiong
Xiaomi EV
H
Haiyang Sun
Xiaomi EV
B
Bing Wang
Xiaomi EV
Kun Ma
Kun Ma
University of Jinan
Model-driven EngineeringBig Data ManagementData Intensive Computing
H
Hangjun Ye
Xiaomi EV
W
Wenyu Liu
Huazhong University of Science and Technology
Xinggang Wang
Xinggang Wang
Professor, Huazhong University of Science and Technology
Artificial IntelligenceComputer VisionAutonomous DrivingObject DetectionObject Segmentation