🤖 AI Summary
Existing driving video generation models produce photorealistic videos conditioned on 2D layouts (e.g., HD maps, bounding boxes) but struggle with 3D geometric consistency and robust conditional control across multiple views. To address this, we propose a spatially adaptive generative framework featuring two key innovations: (1) a high-fidelity, controllable 3D scene conditioning mechanism, and (2) a consistency adapter module that jointly integrates 3D-aware generation, spatially adaptive condition encoding, multi-view geometric constraints, and diffusion model fine-tuning. Our method explicitly models cross-view geometric consistency while preserving strong conditional controllability. Experiments demonstrate significant improvements in both 3D fidelity and visual realism of generated videos. Under joint HD map and multi-view conditioning, our approach achieves state-of-the-art cross-view consistency—outperforming prior methods in quantitative and qualitative evaluations.
📝 Abstract
Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency remains a major challenge. To tackle this, we introduce a novel spatial adaptive generation framework, CoGen, which leverages advances in 3D generation to improve performance in two key aspects: (i) To ensure 3D consistency, we first generate high-quality, controllable 3D conditions that capture the geometry of driving scenes. By replacing coarse 2D conditions with these fine-grained 3D representations, our approach significantly enhances the spatial consistency of the generated videos. (ii) Additionally, we introduce a consistency adapter module to strengthen the robustness of the model to multi-condition control. The results demonstrate that this method excels in preserving geometric fidelity and visual realism, offering a reliable video generation solution for autonomous driving.