MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autonomous driving video generation methods are limited to unimodal RGB output, hindering joint generation of multimodal outputs—such as depth and semantic maps—and suffer from complex multi-model deployment and unmodeled cross-modal complementarity. To address these limitations, we propose the first unified multimodal, multiview video generation framework tailored for urban driving scenes. Built upon a diffusion Transformer architecture, our method introduces a dual-path design that jointly models modality-shared and modality-specific representations, integrating multimodal conditional encoding, cross-modal feature interaction, and multiview consistency modeling. Evaluated on the nuScenes dataset, our approach achieves, for the first time, synchronized high-fidelity generation of RGB, depth, and semantic maps. It significantly improves structural controllability and inter-view consistency, outperforming state-of-the-art methods across key metrics.

Technology Category

Application Category

📝 Abstract
Video generation has recently shown superiority in urban scene synthesis for autonomous driving. Existing video generation approaches to autonomous driving primarily focus on RGB video generation and lack the ability to support multi-modal video generation. However, multi-modal data, such as depth maps and semantic maps, are crucial for holistic urban scene understanding in autonomous driving. Although it is feasible to use multiple models to generate different modalities, this increases the difficulty of model deployment and does not leverage complementary cues for multi-modal data generation. To address this problem, in this work, we propose a novel multi-modal multi-view video generation approach to autonomous driving. Specifically, we construct a unified diffusion transformer model composed of modal-shared components and modal-specific components. Then, we leverage diverse conditioning inputs to encode controllable scene structure and content cues into the unified diffusion model for multi-modal multi-view video generation. In this way, our approach is capable of generating multi-modal multi-view driving scene videos in a unified framework. Our experiments on the challenging real-world autonomous driving dataset, nuScenes, show that our approach can generate multi-modal multi-view urban scene videos with high fidelity and controllability, surpassing the state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Generates multi-modal urban driving videos
Unifies depth, semantic and RGB generation
Enables controllable multi-view scene synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified diffusion transformer model for generation
Leverages diverse conditioning inputs for controllability
Generates multi-modal multi-view videos in unified framework
🔎 Similar Papers
No similar papers found.