🤖 AI Summary
To address the prohibitive computational and memory costs of high-resolution voxelized 3D generation, this paper introduces the first scalable generative framework based on sparse volumetric representations, enabling high-fidelity synthesis of 3D shapes at gigavoxel resolution (1024³). Our method comprises three key innovations: (1) Spatially Sparse Attention (SSA), the first attention mechanism designed for diffusion Transformers operating efficiently over dynamically sparse voxel grids; (2) an end-to-end unified sparse volumetric Variational Autoencoder (VAE), enhancing training stability and latent representation efficiency; and (3) distributed GPU optimization, accelerating forward and backward passes by 3.9× and 9.6×, respectively. Remarkably, our framework trains 1024³ models using only eight GPUs—contrasting sharply with prior state-of-the-art methods requiring ≥32 GPUs for 256³ resolution. Both generation quality and efficiency surpass existing approaches, establishing new state-of-the-art performance.
📝 Abstract
Generating high resolution 3D shapes using volumetric representations such as Signed Distance Functions presents substantial computational and memory challenges. We introduce Direct3D S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention mechanism, which greatly enhances the efficiency of Diffusion Transformer computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, significantly reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://nju3dv.github.io/projects/Direct3D-S2/.