๐ค AI Summary
Existing autonomous driving scene generation models suffer from narrow modality coverage and weak controllability, hindering comprehensive system-level evaluation. To address this, we propose an end-to-end multimodal driving scene generation framework that, for the first time, explicitly incorporates high-definition (HD) maps as a dedicated modality and introduces an Action-aware Map Alignment (AMA) mechanism to jointly model images, LiDAR point clouds, agent trajectories, and HD mapsโenabling controllable long-sequence generation (โฅ5 s). Methodologically, we adopt a two-stage autoregressive architecture (TAR + OAR) to separately capture temporal dynamics and cross-modal spatial consistency. Leveraging modality-specific tokenization and action-driven geometric transformations, our approach enforces strong physical constraints. Evaluated on benchmarks including nuScenes, our method significantly improves inter-modal consistency and physical plausibility, while supporting fine-grained scene editing and robustness assessment.
๐ Abstract
Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.