CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving

📅 2025-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing driving video generation models produce photorealistic videos conditioned on 2D layouts (e.g., HD maps, bounding boxes) but struggle with 3D geometric consistency and robust conditional control across multiple views. To address this, we propose a spatially adaptive generative framework featuring two key innovations: (1) a high-fidelity, controllable 3D scene conditioning mechanism, and (2) a consistency adapter module that jointly integrates 3D-aware generation, spatially adaptive condition encoding, multi-view geometric constraints, and diffusion model fine-tuning. Our method explicitly models cross-view geometric consistency while preserving strong conditional controllability. Experiments demonstrate significant improvements in both 3D fidelity and visual realism of generated videos. Under joint HD map and multi-view conditioning, our approach achieves state-of-the-art cross-view consistency—outperforming prior methods in quantitative and qualitative evaluations.

Technology Category

Application Category

📝 Abstract
Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency remains a major challenge. To tackle this, we introduce a novel spatial adaptive generation framework, CoGen, which leverages advances in 3D generation to improve performance in two key aspects: (i) To ensure 3D consistency, we first generate high-quality, controllable 3D conditions that capture the geometry of driving scenes. By replacing coarse 2D conditions with these fine-grained 3D representations, our approach significantly enhances the spatial consistency of the generated videos. (ii) Additionally, we introduce a consistency adapter module to strengthen the robustness of the model to multi-condition control. The results demonstrate that this method excels in preserving geometric fidelity and visual realism, offering a reliable video generation solution for autonomous driving.
Problem

Research questions and friction points this paper is trying to address.

Achieving 3D consistent multi-view driving videos
Enhancing spatial consistency with fine-grained 3D conditions
Improving robustness for multi-condition control in generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates high-quality controllable 3D conditions
Uses spatial adaptive generation framework
Introduces consistency adapter module
🔎 Similar Papers
No similar papers found.
Y
Yishen Ji
Nanjing University
Z
Ziyue Zhu
Xiaomi EV
Zhenxin Zhu
Zhenxin Zhu
Xiaomi AD
AIGCNeRF
K
Kaixin Xiong
Xiaomi EV
M
Ming Lu
Peking University
Zhiqi Li
Zhiqi Li
PhD, Nanjing University
computer vision
Lijun Zhou
Lijun Zhou
Xiaomi Corporation
H
Haiyang Sun
Xiaomi EV
B
Bing Wang
Xiaomi EV
T
Tong Lu
Nanjing University