🤖 AI Summary
High-fidelity dynamic 4D generation is hindered by temporal artifacts, substantial computational costs, and scarcity of training data. This work addresses these challenges by introducing a temporally decaying sparse attention mechanism built upon the pretrained 3D diffusion Transformer Hunyuan3D 2.1. The approach anchors object identity using an initial frame and models spatiotemporal dynamics through block-sparse attention combined with a time-decay masking strategy. By doing so, it significantly reduces computational overhead by 56% while preserving identity consistency, achieving state-of-the-art performance in temporal coherence and generation quality for 4D content. This advancement overcomes critical bottlenecks in data and compute requirements, paving the way for efficient and scalable 4D generation.
📝 Abstract
Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis and charts a path toward efficient and scalable 4D generation.