🤖 AI Summary
To address the critical need for novel-view synthesis in dynamic scenes for AR/VR and the metaverse, this paper formulates spatiotemporal 4D modeling as a native learning problem. We propose the first end-to-end 4D Gaussian ellipsoid representation framework: employing rotatable, anisotropic 4D Gaussians as explicit geometric and appearance primitives, integrating 4D spherical harmonics for time-varying, view-dependent appearance modeling, and introducing 4D Gaussian splatting with differentiable volumetric rendering. Our method requires no explicit motion priors and enables real-time, high-resolution, photorealistic rendering. It significantly outperforms state-of-the-art methods across diverse scenarios—including single-object, indoor, and driving scenes—achieving superior trade-offs between visual fidelity and computational efficiency. Specifically, it reduces memory consumption by 40% and accelerates training by 2.3×. Furthermore, the framework naturally extends to 4D generative modeling and scene understanding tasks.
📝 Abstract
Dynamic 3D scene representation and novel view synthesis from captured videos are crucial for enabling immersive experiences required by AR/VR and metaverse applications. However, this task is challenging due to the complexity of unconstrained real-world scenes and their temporal dynamics. In this paper, we frame dynamic scenes as a spatio-temporal 4D volume learning problem, offering a native explicit reformulation with minimal assumptions about motion, which serves as a versatile dynamic scene learning framework. Specifically, we represent a target dynamic scene using a collection of 4D Gaussian primitives with explicit geometry and appearance features, dubbed as 4D Gaussian splatting (4DGS). This approach can capture relevant information in space and time by fitting the underlying spatio-temporal volume. Modeling the spacetime as a whole with 4D Gaussians parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, our model can naturally learn view-dependent and time-evolved appearance with 4D spherindrical harmonics. Notably, our 4DGS model is the first solution that supports real-time rendering of high-resolution, photorealistic novel views for complex dynamic scenes. To enhance efficiency, we derive several compact variants that effectively reduce memory footprint and mitigate the risk of overfitting. Extensive experiments validate the superiority of 4DGS in terms of visual quality and efficiency across a range of dynamic scene-related tasks (e.g., novel view synthesis, 4D generation, scene understanding) and scenarios (e.g., single object, indoor scenes, driving environments, synthetic and real data).