Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

High-fidelity dynamic 4D generation is hindered by temporal artifacts, substantial computational costs, and scarcity of training data. This work addresses these challenges by introducing a temporally decaying sparse attention mechanism built upon the pretrained 3D diffusion Transformer Hunyuan3D 2.1. The approach anchors object identity using an initial frame and models spatiotemporal dynamics through block-sparse attention combined with a time-decay masking strategy. By doing so, it significantly reduces computational overhead by 56% while preserving identity consistency, achieving state-of-the-art performance in temporal coherence and generation quality for 4D content. This advancement overcomes critical bottlenecks in data and compute requirements, paving the way for efficient and scalable 4D generation.

Technology Category

Application Category

📝 Abstract

Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis and charts a path toward efficient and scalable 4D generation.

Problem

Research questions and friction points this paper is trying to address.

4D generation

temporal artifacts

computational demand

spatiotemporal dependencies

dynamic shape synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

4D generation

Sparse Attention

Diffusion Transformer