🤖 AI Summary
Existing 3D Gaussian Splatting reconstructions accurately model static scenes but lack dynamic expressiveness; conversely, video diffusion models generate photorealistic motion yet struggle with multi-view consistency. This paper introduces the first text-driven framework for animating pre-reconstructed static 3D Gaussian Splatting scenes. Our method bridges 2D video generation capabilities with 3D spatial reasoning by jointly leveraging video diffusion priors and a dedicated 3D motion enhancement module. It injects physically plausible, multi-view-consistent dynamics into arbitrary pre-built Gaussian Splatting scenes while preserving their original geometric structure. Unlike prior works limited to single-object or character animation, our approach generalizes across diverse object categories and complex real-world scenes. The result is high-fidelity, controllable, text-to-dynamic-3D generation with strict multi-view consistency—enabling novel applications in immersive content creation and interactive 3D storytelling.
📝 Abstract
State-of-the-art novel view synthesis methods achieve impressive results for multi-view captures of static 3D scenes. However, the reconstructed scenes still lack"liveliness,"a key component for creating engaging 3D experiences. Recently, novel video diffusion models generate realistic videos with complex motion and enable animations of 2D images, however they cannot naively be used to animate 3D scenes as they lack multi-view consistency. To breathe life into the static world, we propose Gaussians2Life, a method for animating parts of high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is to leverage powerful video diffusion models as the generative component of our model and to combine these with a robust technique to lift 2D videos into meaningful 3D motion. We find that, in contrast to prior work, this enables realistic animations of complex, pre-existing 3D scenes and further enables the animation of a large variety of object classes, while related work is mostly focused on prior-based character animation, or single 3D objects. Our model enables the creation of consistent, immersive 3D experiences for arbitrary scenes.