🤖 AI Summary
This work addresses the challenge of achieving efficient, high-quality, and temporally consistent novel-view synthesis in dynamic urban scenes. The authors propose a feedforward 4D scene synthesis framework that decomposes the scene into three branches to model near-static structures, dynamic objects, and distant regions, respectively. By integrating voxel-level 3D Gaussian representations with object-centric dynamic modeling—a first in the field—the method overcomes the temporal inconsistency inherent in conventional per-pixel Gaussian approaches. The framework combines 3D feature volume-based static geometry prediction, canonical-space dynamic entity modeling, motion-aware rendering, and semantics-enhanced image synthesis. Experiments on KITTI-360 and Waymo demonstrate significant improvements over both feedforward and per-scene optimization baselines, achieving state-of-the-art performance in efficiency, reconstruction accuracy, and temporal coherence for 4D urban scene reconstruction.
📝 Abstract
Novel view synthesis (NVS) of static and dynamic urban scenes is essential for autonomous driving simulation, yet existing methods often struggle to balance reconstruction time with quality. While state-of-the-art neural radiance fields and 3D Gaussian Splatting approaches achieve photorealism, they often rely on time-consuming per-scene optimization. Conversely, emerging feed-forward methods frequently adopt per-pixel Gaussian representations, which lead to 3D inconsistencies when aggregating multi-view predictions in complex, dynamic environments. We propose EvolSplat4D, a feed-forward framework that moves beyond existing per-pixel paradigms by unifying volume-based and pixel-based Gaussian prediction across three specialized branches. For close-range static regions, we predict consistent geometry of 3D Gaussians over multiple frames directly from a 3D feature volume, complemented by a semantically-enhanced image-based rendering module for predicting their appearance. For dynamic actors, we utilize object-centric canonical spaces and a motion-adjusted rendering module to aggregate temporal features, ensuring stable 4D reconstruction despite noisy motion priors. Far-Field scenery is handled by an efficient per-pixel Gaussian branch to ensure full-scene coverage. Experimental results on the KITTI-360, KITTI, Waymo, and PandaSet datasets show that EvolSplat4D reconstructs both static and dynamic environments with superior accuracy and consistency, outperforming both per-scene optimization and state-of-the-art feed-forward baselines.