🤖 AI Summary
3D generative models face fundamental trade-offs between geometric fidelity, computational efficiency, and scalability due to the difficulty of modeling complex spatial structures and the high cost of volumetric representation. To address this, we propose VoxSet—a semi-structured voxel representation that explicitly encodes spatial topology under high compression, enabling position-aware generation and token-level test-time scaling. Our two-stage framework first generates sparse voxel anchors, then refines geometry to arbitrary resolution via a correction-flow Transformer. This unifies structured 3D modeling with efficient decoding, drastically reducing training overhead. Evaluated across multiple benchmarks, VoxSet achieves state-of-the-art performance in fidelity, diversity, and scalability metrics. It enables high-fidelity, large-scale, and flexible inference for 3D asset generation—supporting resolution-agnostic output and interactive editing without retraining.
📝 Abstract
We present LATTICE, a new framework for high-fidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. While 2D image synthesis benefits from fixed spatial grids and well-established transformer architectures, 3D generation remains fundamentally more challenging due to the need to predict both spatial structure and detailed geometric surfaces from scratch. These challenges are exacerbated by the computational complexity of existing 3D representations and the lack of structured and scalable 3D asset encoding schemes. To address this, we propose VoxSet, a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid, enabling efficient and position-aware generation. VoxSet retains the simplicity and compression advantages of prior VecSet methods while introducing explicit structure into the latent space, allowing positional embeddings to guide generation and enabling strong token-level test-time scaling. Built upon this representation, LATTICE adopts a two-stage pipeline: first generating a sparse voxelized geometry anchor, then producing detailed geometry using a rectified flow transformer. Our method is simple at its core, but supports arbitrary resolution decoding, low-cost training, and flexible inference schemes, achieving state-of-the-art performance on various aspects, and offering a significant step toward scalable, high-quality 3D asset creation.