Generating Multimodal Driving Scenes via Next-Scene Prediction

📅 2025-03-19

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing autonomous driving scene generation models suffer from narrow modality coverage and weak controllability, hindering comprehensive system-level evaluation. To address this, we propose an end-to-end multimodal driving scene generation framework that, for the first time, explicitly incorporates high-definition (HD) maps as a dedicated modality and introduces an Action-aware Map Alignment (AMA) mechanism to jointly model images, LiDAR point clouds, agent trajectories, and HD maps—enabling controllable long-sequence generation (≥5 s). Methodologically, we adopt a two-stage autoregressive architecture (TAR + OAR) to separately capture temporal dynamics and cross-modal spatial consistency. Leveraging modality-specific tokenization and action-driven geometric transformations, our approach enforces strong physical constraints. Evaluated on benchmarks including nuScenes, our method significantly improves inter-modal consistency and physical plausibility, while supporting fine-grained scene editing and robustness assessment.

Technology Category

Application Category

📝 Abstract

Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.

Problem

Research questions and friction points this paper is trying to address.

Generates diverse multimodal driving scenes for AD evaluation.

Incorporates map modality for enhanced scene controllability.

Ensures coherence and consistency across extended driving sequences.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework integrates four data modalities

Two-stage autoregressive approach manages computational demands

Action-aware Map Alignment ensures modality coherence

🔎 Similar Papers

DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes