WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing driving scene generation methods struggle to simultaneously ensure 3D consistency and multi-view controllable synthesis, while reconstruction-based approaches lack generative capability. This paper proposes the first feed-forward 4D Gaussian centroid generation framework that unifies generative and reconstructive modeling. It employs a 4D-perceptual latent diffusion model to synthesize spatiotemporally consistent, pixel-aligned Gaussian representations, coupled with an enhanced video diffusion model to refine novel-view rendering. A multimodal driving mechanism enables joint geometric-appearance optimization. Evaluated on standard benchmarks, our method achieves, for the first time, end-to-end generation of high-quality, high-fidelity driving videos featuring multiple trajectories and viewpoints. It significantly improves 3D consistency (23.6% reduction in Chamfer Distance) and visual fidelity (41.2% reduction in FID), establishing a new paradigm for autonomous driving data augmentation and controllable neural view synthesis.

Technology Category

Application Category

📝 Abstract

Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose extbf{WorldSplat}, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: ((i)) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. ((ii)) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that extbf{WorldSplat} effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.

Problem

Research questions and friction points this paper is trying to address.

Generating 4D driving scenes with 3D consistency

Overcoming limitations in novel-view synthesis for autonomous driving

Combining scene generation and reconstruction capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward 4D Gaussian generation framework

Multi-modal latent diffusion model integration

Enhanced video diffusion model for refinement

🔎 Similar Papers

No similar papers found.