UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes UniDriveDreamer, a single-stage unified multimodal world model capable of directly generating future observations from multiple camera views and LiDAR sequences without relying on intermediate representations or cascaded architectures. Existing approaches are often limited to unimodal generation and struggle to jointly synthesize synchronized video and LiDAR data. The key innovation lies in the Unified Latent Anchoring (ULA) mechanism, which effectively aligns the latent distributions of LiDAR and video modalities, further guided by a structured scene layout as a conditional signal. Integrating modality-specific VAEs for LiDAR and video with a diffusion-based Transformer, UniDriveDreamer outperforms current methods in multimodal generation and significantly enhances downstream perception and planning performance.

Technology Category

Application Category

📝 Abstract
World models have demonstrated significant promise for data synthesis in autonomous driving. However, existing methods predominantly concentrate on single-modality generation, typically focusing on either multi-camera video or LiDAR sequence synthesis. In this paper, we propose UniDriveDreamer, a single-stage unified multimodal world model for autonomous driving, which directly generates multimodal future observations without relying on intermediate representations or cascaded modules. Our framework introduces a LiDAR-specific variational autoencoder (VAE) designed to encode input LiDAR sequences, alongside a video VAE for multi-camera images. To ensure cross-modal compatibility and training stability, we propose Unified Latent Anchoring (ULA), which explicitly aligns the latent distributions of the two modalities. The aligned features are fused and processed by a diffusion transformer that jointly models their geometric correspondence and temporal evolution. Additionally, structured scene layout information is projected per modality as a conditioning signal to guide the synthesis. Extensive experiments demonstrate that UniDriveDreamer outperforms previous state-of-the-art methods in both video and LiDAR generation, while also yielding measurable improvements in downstream
Problem

Research questions and friction points this paper is trying to address.

world model
autonomous driving
multimodal generation
LiDAR
video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal world model
single-stage generation
Unified Latent Anchoring
diffusion transformer
LiDAR-video synthesis
🔎 Similar Papers
No similar papers found.
Guosheng Zhao
Guosheng Zhao
Institute of Automation, Chinese Academic of Scienes
Y
Yaozeng Wang
GigaAI
X
Xiaofeng Wang
GigaAI
Z
Zheng Zhu
GigaAI
T
Tingdong Yu
GigaAI
G
Guan Huang
GigaAI
Y
Yongchen Zai
BYD
J
Ji Jiao
BYD
C
Changliang Xue
BYD
X
Xiaole Wang
BYD
Z
Zhen Yang
BYD
F
Futang Zhu
BYD
X
Xingang Wang
CASIA