MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations

📅 2023-11-20

🏛️ arXiv.org

📈 Citations: 13

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing autonomous driving world models predominantly rely on unimodal camera inputs, lacking systematic exploration of joint modeling between LiDAR and 3D occupancy. This paper introduces the first generative multimodal world model integrating camera and LiDAR modalities, jointly modeling raw observations (images and point clouds) alongside geometrically constrained voxelized 3D occupancy grids. Our approach employs a cross-modal Transformer for temporal modeling and a unified autoregressive prediction framework to enable end-to-end co-learning. Crucially, we empirically reveal the limitations of feature-level fusion in preserving geometric fidelity, and demonstrate that explicit 3D occupancy representation significantly enhances scene reasoning capability and action generalization. On nuScenes, our method achieves a 12.7% reduction in multi-step sensor reconstruction LPIPS and an 8.3% improvement in 3D occupancy mIoU, establishing new state-of-the-art performance.

📝 Abstract

World models for autonomous driving have the potential to dramatically improve the reasoning capabilities of today's systems. However, most works focus on camera data, with only a few that leverage lidar data or combine both to better represent autonomous vehicle sensor setups. In addition, raw sensor predictions are less actionable than 3D occupancy predictions, but there are no works examining the effects of combining both multimodal sensor data and 3D occupancy prediction. In this work, we perform a set of experiments with a MUltimodal World Model with Geometric VOxel representations (MUVO) to evaluate different sensor fusion strategies to better understand the effects on sensor data prediction. We also analyze potential weaknesses of current sensor fusion approaches and examine the benefits of additionally predicting 3D occupancy.

Problem

Research questions and friction points this paper is trying to address.

Evaluating sensor fusion strategies for autonomous driving world models

Combining multimodal sensor data with 3D occupancy prediction

Analyzing weaknesses in current sensor fusion approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fusion of camera and lidar data

Geometric voxel representations for 3D occupancy

Analyzing sensor fusion strategies for autonomy

🔎 Similar Papers

BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents