CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the need for high-fidelity, multi-view, long-horizon video generation and 4D scene reconstruction in autonomous driving, this paper proposes a spatio-temporal reconstruction VAE framework that integrates a video diffusion prior with a variational autoencoder, incorporates a cross-view generation mechanism, and employs a Gaussian splatting decoder to achieve geometrically consistent dynamic scene reconstruction. Key contributions include the design of an auxiliary 4D reconstruction task that explicitly enforces 3D structural and temporal modeling capabilities in the latent space, enabling joint optimization of generative quality and depth/motion estimation accuracy. Our method achieves significant improvements over state-of-the-art approaches on both FID and FVD metrics. Generated videos exhibit high visual fidelity, precise depth estimation, and physically plausible motion structure—effectively supporting autonomous driving simulation, environmental understanding, and future state prediction.

Technology Category

Application Category

📝 Abstract
Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding.
Problem

Research questions and friction points this paper is trying to address.

Generates multi-view driving videos with 4D reconstruction
Improves video generation quality using spatial-temporal VAE
Reconstructs dynamic scenes for geometric scene understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-view video diffusion with spatial-temporal reconstruction VAE
Fine-tuned VAE with auxiliary 4D reconstruction task
Integrated Gaussian Splatting Decoder for dynamic scene reconstruction
🔎 Similar Papers
No similar papers found.
T
Tianrui Zhang
The Hong Kong University of Science and Technology
Y
Yichen Liu
Sensetime Research
Z
Zilin Guo
The Hong Kong University of Science and Technology
Y
Yuxin Guo
Sensetime Research
J
Jingcheng Ni
Sensetime Research
Chenjing Ding
Chenjing Ding
Unknown affiliation
D
Dan Xu
The Hong Kong University of Science and Technology
Lewei Lu
Lewei Lu
Research Director (We're Hiring, luotto@sensetime.com) @ SenseTime Research
Computer VisionDeep Learning
Z
Zehuan Wu
Sensetime Research