WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing driving scene generation methods struggle to simultaneously ensure 3D consistency and multi-view controllable synthesis, while reconstruction-based approaches lack generative capability. This paper proposes the first feed-forward 4D Gaussian centroid generation framework that unifies generative and reconstructive modeling. It employs a 4D-perceptual latent diffusion model to synthesize spatiotemporally consistent, pixel-aligned Gaussian representations, coupled with an enhanced video diffusion model to refine novel-view rendering. A multimodal driving mechanism enables joint geometric-appearance optimization. Evaluated on standard benchmarks, our method achieves, for the first time, end-to-end generation of high-quality, high-fidelity driving videos featuring multiple trajectories and viewpoints. It significantly improves 3D consistency (23.6% reduction in Chamfer Distance) and visual fidelity (41.2% reduction in FID), establishing a new paradigm for autonomous driving data augmentation and controllable neural view synthesis.

Technology Category

Application Category

📝 Abstract
Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose extbf{WorldSplat}, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: ((i)) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. ((ii)) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that extbf{WorldSplat} effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.
Problem

Research questions and friction points this paper is trying to address.

Generating 4D driving scenes with 3D consistency
Overcoming limitations in novel-view synthesis for autonomous driving
Combining scene generation and reconstruction capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward 4D Gaussian generation framework
Multi-modal latent diffusion model integration
Enhanced video diffusion model for refinement
🔎 Similar Papers
No similar papers found.
Z
Ziyue Zhu
Nankai University
Z
Zhanqian Wu
Xiaomi EV
Zhenxin Zhu
Zhenxin Zhu
Xiaomi AD
AIGCNeRF
Lijun Zhou
Lijun Zhou
Xiaomi Corporation
H
Haiyang Sun
Xiaomi EV
B
Bing Wan
Xiaomi EV
Kun Ma
Kun Ma
University of Jinan
Model-driven EngineeringBig Data ManagementData Intensive Computing
G
Guang Chen
Xiaomi EV
H
Hangjun Ye
Xiaomi EV
J
Jin Xie
Nanjing University, Suzhou
J
Jian Yang
Nankai University