EVolSplat4D: Efficient Volume-based Gaussian Splatting for 4D Urban Scene Synthesis

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenge of achieving efficient, high-quality, and temporally consistent novel-view synthesis in dynamic urban scenes. The authors propose a feedforward 4D scene synthesis framework that decomposes the scene into three branches to model near-static structures, dynamic objects, and distant regions, respectively. By integrating voxel-level 3D Gaussian representations with object-centric dynamic modeling—a first in the field—the method overcomes the temporal inconsistency inherent in conventional per-pixel Gaussian approaches. The framework combines 3D feature volume-based static geometry prediction, canonical-space dynamic entity modeling, motion-aware rendering, and semantics-enhanced image synthesis. Experiments on KITTI-360 and Waymo demonstrate significant improvements over both feedforward and per-scene optimization baselines, achieving state-of-the-art performance in efficiency, reconstruction accuracy, and temporal coherence for 4D urban scene reconstruction.

Technology Category

Application Category

📝 Abstract

Novel view synthesis (NVS) of static and dynamic urban scenes is essential for autonomous driving simulation, yet existing methods often struggle to balance reconstruction time with quality. While state-of-the-art neural radiance fields and 3D Gaussian Splatting approaches achieve photorealism, they often rely on time-consuming per-scene optimization. Conversely, emerging feed-forward methods frequently adopt per-pixel Gaussian representations, which lead to 3D inconsistencies when aggregating multi-view predictions in complex, dynamic environments. We propose EvolSplat4D, a feed-forward framework that moves beyond existing per-pixel paradigms by unifying volume-based and pixel-based Gaussian prediction across three specialized branches. For close-range static regions, we predict consistent geometry of 3D Gaussians over multiple frames directly from a 3D feature volume, complemented by a semantically-enhanced image-based rendering module for predicting their appearance. For dynamic actors, we utilize object-centric canonical spaces and a motion-adjusted rendering module to aggregate temporal features, ensuring stable 4D reconstruction despite noisy motion priors. Far-Field scenery is handled by an efficient per-pixel Gaussian branch to ensure full-scene coverage. Experimental results on the KITTI-360, KITTI, Waymo, and PandaSet datasets show that EvolSplat4D reconstructs both static and dynamic environments with superior accuracy and consistency, outperforming both per-scene optimization and state-of-the-art feed-forward baselines.

Problem

Research questions and friction points this paper is trying to address.

novel view synthesis

4D urban scene

3D Gaussian Splatting

dynamic scenes

reconstruction consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian Splatting

4D Scene Synthesis

Feed-forward Framework