MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations

📅 2023-11-20
🏛️ arXiv.org
📈 Citations: 13
Influential: 0
📄 PDF
🤖 AI Summary
Existing autonomous driving world models predominantly rely on unimodal camera inputs, lacking systematic exploration of joint modeling between LiDAR and 3D occupancy. This paper introduces the first generative multimodal world model integrating camera and LiDAR modalities, jointly modeling raw observations (images and point clouds) alongside geometrically constrained voxelized 3D occupancy grids. Our approach employs a cross-modal Transformer for temporal modeling and a unified autoregressive prediction framework to enable end-to-end co-learning. Crucially, we empirically reveal the limitations of feature-level fusion in preserving geometric fidelity, and demonstrate that explicit 3D occupancy representation significantly enhances scene reasoning capability and action generalization. On nuScenes, our method achieves a 12.7% reduction in multi-step sensor reconstruction LPIPS and an 8.3% improvement in 3D occupancy mIoU, establishing new state-of-the-art performance.
📝 Abstract
World models for autonomous driving have the potential to dramatically improve the reasoning capabilities of today's systems. However, most works focus on camera data, with only a few that leverage lidar data or combine both to better represent autonomous vehicle sensor setups. In addition, raw sensor predictions are less actionable than 3D occupancy predictions, but there are no works examining the effects of combining both multimodal sensor data and 3D occupancy prediction. In this work, we perform a set of experiments with a MUltimodal World Model with Geometric VOxel representations (MUVO) to evaluate different sensor fusion strategies to better understand the effects on sensor data prediction. We also analyze potential weaknesses of current sensor fusion approaches and examine the benefits of additionally predicting 3D occupancy.
Problem

Research questions and friction points this paper is trying to address.

Evaluating sensor fusion strategies for autonomous driving world models
Combining multimodal sensor data with 3D occupancy prediction
Analyzing weaknesses in current sensor fusion approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fusion of camera and lidar data
Geometric voxel representations for 3D occupancy
Analyzing sensor fusion strategies for autonomy
🔎 Similar Papers
No similar papers found.
D
Daniel Bogdoll
FZI Research Center for Information Technology, 76131 Karlsruhe, Germany; KIT Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany
Yitian Yang
Yitian Yang
AI4SG Lab, National University of Singapore
Human Computer InteractionHuman-centered AI
J
J. Zollner
FZI Research Center for Information Technology, 76131 Karlsruhe, Germany; KIT Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany