OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Autonomous driving data collection is costly and suffers from scarcity of long-tail scenarios; existing generative methods are largely unimodal, leading to misalignment across sensor modalities. To address this, we propose the first unified multimodal generation framework grounded in a shared bird’s-eye view (BEV) space. Our approach integrates a Universal Autoencoder (UAE) with a controllable Diffusion Transformer augmented by ControlNet, enabling joint synthesis of LiDAR point clouds and multi-view camera images through BEV feature encoding, volumetric rendering, and conditional control. The framework ensures geometric and semantic consistency across modalities while supporting flexible sensor configurations. Extensive experiments demonstrate significant improvements over unimodal baselines in generation fidelity, cross-modal alignment, and controllability—establishing a new state of the art in multimodal autonomous driving data synthesis.

Technology Category

Application Category

📝 Abstract

Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Birdu2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.

Problem

Research questions and friction points this paper is trying to address.

Generates aligned multimodal sensor data for autonomous driving

Unifies LiDAR and camera data in a shared BEV framework

Enables controllable and consistent sensor generation with diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified BEV space aligns multimodal sensor features

UAE method jointly decodes LiDAR and camera data

Diffusion Transformer with ControlNet enables controllable generation

🔎 Similar Papers

MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations