🤖 AI Summary
This work addresses the challenge of efficiently generating semantically coherent 3D voxel environments that support interactive user editing. To this end, the authors construct a large-scale Minecraft world dataset comprising tens of billions of voxel blocks and introduce, for the first time, a block-based 3D diffusion generation paradigm. This approach integrates a voxel-level Transformer architecture with a hybrid data strategy within both discrete and continuous diffusion frameworks. The method enables interactive editing capabilities such as local inpainting and outpainting, and introduces a 3D-aware Fréchet Inception Distance (FID) metric alongside human preference evaluations tailored to semantic scene quality. Experiments demonstrate that the model generates high-quality, semantically consistent environments, significantly outperforming baseline methods in human assessments. The dataset, code, and pretrained models are publicly released.
📝 Abstract
We introduce Dream-Cubed, a large-scale dataset of Minecraft worlds at voxel resolution, and a family of models using cubes as powerful compositional units for efficient generation of interactive 3D environments. Dream-Cubed comprises tens of billions of tokens from a carefully curated mixture of procedural biome terrain and high-quality human-authored maps. We use this dataset to conduct the first large-scale study of 3D diffusion models for voxel generation, analyzing discrete and continuous diffusion formulations, data compositions, and architectural design choices. Our models operate directly in the space of blocks, enabling efficient and semantically grounded generation while supporting interactive user workflows such as inpainting and outpainting from user-authored blocks. To quantitatively evaluate our models, we adapt the FID metric to assess semantic differences between real and generated world renderings, and validate generation quality through a human preference study. We release the full dataset, code, and all our pretrained models, which we hope will provide a foundation for future research in efficient generative modeling for structured, interactive 3D environments.