BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents

📅 2024-07-08
📈 Citations: 16
Influential: 2
📄 PDF
🤖 AI Summary
This work addresses the challenges of unified multimodal modeling (camera/LiDAR) and long-horizon scene forecasting in autonomous driving. Methodologically, it introduces the first scene-level BEV latent sequence diffusion framework tailored for world models: (1) a multimodal tokenizer enables cross-modal alignment; (2) BEV latent-space ray-casting rendering supports bidirectional encoding and decoding; and (3) an action-conditioned latent diffusion model generates spatiotemporally coherent future BEV sequences. The key contribution lies in elevating BEV representation from static perception to a generative, reasoning-capable dynamic world model. Evaluated on nuScenes and other benchmarks, the method significantly improves the realism of future scene generation and achieves state-of-the-art performance on downstream tasks—including 3D object detection and motion forecasting—demonstrating both fidelity and functional utility of the learned world model.

Technology Category

Application Category

📝 Abstract
World models have attracted increasing attention in autonomous driving for their ability to forecast potential future scenarios. In this paper, we propose BEVWorld, a novel framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for holistic environment modeling. The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model. The multi-modal tokenizer first encodes heterogeneous sensory data, and its decoder reconstructs the latent BEV tokens into LiDAR and surround-view image observations via ray-casting rendering in a self-supervised manner. This enables joint modeling and bidirectional encoding-decoding of panoramic imagery and point cloud data within a shared spatial representation. On top of this, the latent BEV sequence diffusion model performs temporally consistent forecasting of future scenes, conditioned on high-level action tokens, enabling scene-level reasoning over time. Extensive experiments demonstrate the effectiveness of BEVWorld on autonomous driving benchmarks, showcasing its capability in realistic future scene generation and its benefits for downstream tasks such as perception and motion prediction.
Problem

Research questions and friction points this paper is trying to address.

Transforms multimodal sensor inputs into unified BEV latent space
Enables joint modeling of panoramic imagery and point cloud data
Performs temporally consistent forecasting of future driving scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms multimodal inputs into unified BEV latent space
Uses multi-modal tokenizer for joint sensory data encoding
Employs latent BEV diffusion for future scene forecasting
🔎 Similar Papers
No similar papers found.
Y
Yumeng Zhang
Baidu Inc., China
S
Shi Gong
Baidu Inc., China
K
Kaixin Xiong
Baidu Inc., China
Xiaoqing Ye
Xiaoqing Ye
School of Computing and Artificial Intelligence,Southwest Jiaotong University
Granular Computing、Recommender System、Business Intelligence
X
Xiao Tan
Baidu Inc., China
F
Fan Wang
Baidu Inc., China
Jizhou Huang
Jizhou Huang
Baidu
Generative AIData MiningNatural Language Processing
H
Hua Wu
Baidu Inc., China
H
Haifeng Wang
Baidu Inc., China