๐ค AI Summary
Existing text-driven 3D animation methods struggle to simultaneously achieve geometric/appearance fidelity and motion naturalness. This paper proposes the first text-to-4D animation framework tailored for user-provided static 3D objects: it initializes the 3D scene using a static 4D NeRF and synthesizes dynamics via coupling with a text-guided video diffusion model. We introduce two key innovations: (1) an incremental view-sampling protocol, and (2) an attention-mask-constrained Score Distillation Sampling (SDS) lossโexplicitly enforcing temporal consistency and identity preservation. Experiments demonstrate that our method consistently outperforms baselines in prompt adherence, temporal coherence, and visual fidelity. Notably, it improves LPIPS-based identity preservation by up to 3ร, marking the first approach to jointly optimize high-fidelity static geometry and high-quality dynamic content in text-driven 4D generation.
๐ Abstract
Recent advancements in generative modeling now enable the creation of 4D content (moving 3D objects) controlled with text prompts. 4D generation has large potential in applications like virtual worlds, media, and gaming, but existing methods provide limited control over the appearance and geometry of generated content. In this work, we introduce a method for animating user-provided 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom animations while maintaining the identity of the original object. We first convert a 3D mesh into a ``static"4D Neural Radiance Field (NeRF) that preserves the visual attributes of the input object. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce an incremental viewpoint selection protocol for sampling perspectives to promote lifelike movement and a masked Score Distillation Sampling (SDS) loss, which leverages attention maps to focus optimization on relevant regions. We evaluate our model in terms of temporal coherence, prompt adherence, and visual fidelity and find that our method outperforms baselines that are based on other approaches, achieving up to threefold improvements in identity preservation measured using LPIPS scores, and effectively balancing visual quality with dynamic content.