CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing driving video generation models produce photorealistic videos conditioned on 2D layouts (e.g., HD maps, bounding boxes) but struggle with 3D geometric consistency and robust conditional control across multiple views. To address this, we propose a spatially adaptive generative framework featuring two key innovations: (1) a high-fidelity, controllable 3D scene conditioning mechanism, and (2) a consistency adapter module that jointly integrates 3D-aware generation, spatially adaptive condition encoding, multi-view geometric constraints, and diffusion model fine-tuning. Our method explicitly models cross-view geometric consistency while preserving strong conditional controllability. Experiments demonstrate significant improvements in both 3D fidelity and visual realism of generated videos. Under joint HD map and multi-view conditioning, our approach achieves state-of-the-art cross-view consistency—outperforming prior methods in quantitative and qualitative evaluations.

Technology Category

Application Category

📝 Abstract

Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency remains a major challenge. To tackle this, we introduce a novel spatial adaptive generation framework, CoGen, which leverages advances in 3D generation to improve performance in two key aspects: (i) To ensure 3D consistency, we first generate high-quality, controllable 3D conditions that capture the geometry of driving scenes. By replacing coarse 2D conditions with these fine-grained 3D representations, our approach significantly enhances the spatial consistency of the generated videos. (ii) Additionally, we introduce a consistency adapter module to strengthen the robustness of the model to multi-condition control. The results demonstrate that this method excels in preserving geometric fidelity and visual realism, offering a reliable video generation solution for autonomous driving.

Problem

Research questions and friction points this paper is trying to address.

Achieving 3D consistent multi-view driving videos

Enhancing spatial consistency with fine-grained 3D conditions

Improving robustness for multi-condition control in generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates high-quality controllable 3D conditions

Uses spatial adaptive generation framework

Introduces consistency adapter module

🔎 Similar Papers

DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes