Generating Multimodal Driving Scenes via Next-Scene Prediction

๐Ÿ“… 2025-03-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing autonomous driving scene generation models suffer from narrow modality coverage and weak controllability, hindering comprehensive system-level evaluation. To address this, we propose an end-to-end multimodal driving scene generation framework that, for the first time, explicitly incorporates high-definition (HD) maps as a dedicated modality and introduces an Action-aware Map Alignment (AMA) mechanism to jointly model images, LiDAR point clouds, agent trajectories, and HD mapsโ€”enabling controllable long-sequence generation (โ‰ฅ5 s). Methodologically, we adopt a two-stage autoregressive architecture (TAR + OAR) to separately capture temporal dynamics and cross-modal spatial consistency. Leveraging modality-specific tokenization and action-driven geometric transformations, our approach enforces strong physical constraints. Evaluated on benchmarks including nuScenes, our method significantly improves inter-modal consistency and physical plausibility, while supporting fine-grained scene editing and robustness assessment.

Technology Category

Application Category

๐Ÿ“ Abstract
Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.
Problem

Research questions and friction points this paper is trying to address.

Generates diverse multimodal driving scenes for AD evaluation.
Incorporates map modality for enhanced scene controllability.
Ensures coherence and consistency across extended driving sequences.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework integrates four data modalities
Two-stage autoregressive approach manages computational demands
Action-aware Map Alignment ensures modality coherence
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yanhao Wu
School of Software Engineering, XJTU; Horizon Robotics
Haoyang Zhang
Haoyang Zhang
Ph.D. student of Computer Science, University of Illinois Urbana-Champaign
Computer ArchitectureSystem Software
Tianwei Lin
Tianwei Lin
Zhejiang University
MLLMs
Lichao Huang
Lichao Huang
Senior Engineer, Horizon Robotics Inc
Computer VisionMachine Learning
Shujie Luo
Shujie Luo
Horizon Robotics
R
Rui Wu
Horizon Robotics
C
Congpei Qiu
School of Software Engineering, XJTU
Wei Ke
Wei Ke
Xi'an Jiaotong University
Computer Vision and Deep Learning
T
Tong Zhang
School of Computer and Communication Sciences, EPFL; University of Chinese Academy of Sciences