🤖 AI Summary
Existing driving scene generation methods struggle to simultaneously ensure 3D consistency and multi-view controllable synthesis, while reconstruction-based approaches lack generative capability. This paper proposes the first feed-forward 4D Gaussian centroid generation framework that unifies generative and reconstructive modeling. It employs a 4D-perceptual latent diffusion model to synthesize spatiotemporally consistent, pixel-aligned Gaussian representations, coupled with an enhanced video diffusion model to refine novel-view rendering. A multimodal driving mechanism enables joint geometric-appearance optimization. Evaluated on standard benchmarks, our method achieves, for the first time, end-to-end generation of high-quality, high-fidelity driving videos featuring multiple trajectories and viewpoints. It significantly improves 3D consistency (23.6% reduction in Chamfer Distance) and visual fidelity (41.2% reduction in FID), establishing a new paradigm for autonomous driving data augmentation and controllable neural view synthesis.
📝 Abstract
Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose extbf{WorldSplat}, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: ((i)) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. ((ii)) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that extbf{WorldSplat} effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.