DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes

📅 2024-09-06
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address lane structure distortion, weak foreground object modeling, and insufficient motion modeling in long-sequence, multi-view driving video generation for autonomous driving simulation, this paper proposes an autoregressive diffusion-based video generation framework. Our method introduces three key innovations: (1) a motion-frame-driven temporal attention mechanism that explicitly captures dynamic spatiotemporal dependencies; (2) a perspective-space guidance module jointly optimized with object-level 3D position encoding to enforce local geometric consistency and motion-aware object association; and (3) efficient training on short sequences while enabling controllable synthesis of ultra-long videos (>200 frames). Integrated tightly with the DriveArena simulator, our approach achieves significant improvements over state-of-the-art baselines on the 16-frame benchmark. It supports high-fidelity, long-horizon, and interpretable evaluation of vision-based agents under both open-loop and closed-loop settings.

Technology Category

Application Category

📝 Abstract
Recent advances in diffusion models have improved controllable streetscape generation and supported downstream perception and planning tasks. However, challenges remain in accurately modeling driving scenes and generating long videos. To alleviate these issues, we propose DreamForge, an advanced diffusion-based autoregressive video generation model tailored for 3D-controllable long-term generation. To enhance the lane and foreground generation, we introduce perspective guidance and integrate object-wise position encoding to incorporate local 3D correlation and improve foreground object modeling. We also propose motion-aware temporal attention to capture motion cues and appearance changes in videos. By leveraging motion frames and an autoregressive generation paradigm,we can autoregressively generate long videos (over 200 frames) using a model trained in short sequences, achieving superior quality compared to the baseline in 16-frame video evaluations. Finally, we integrate our method with the realistic simulator DriveArena to provide more reliable open-loop and closed-loop evaluations for vision-based driving agents. Project Page: https://pjlab-adg.github.io/DriveArena/dreamforge.
Problem

Research questions and friction points this paper is trying to address.

Accurate modeling of driving scenes for video generation
Long-term video generation using short sequence training
Integration with simulators for reliable driving agent evaluations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advanced diffusion-based autoregressive video generation model
Perspective guidance and object-wise position encoding
Motion-aware temporal attention for capturing motion cues
🔎 Similar Papers
No similar papers found.