Structured 3D Latents for Scalable and Versatile 3D Generation

๐Ÿ“… 2024-12-02
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 56
โœจ Influential: 18
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the demand for high-fidelity, diverse 3D asset generation and flexible editing, this paper introduces SLATโ€”a structured 3D implicit representation that jointly encodes sparse 3D mesh topology and multi-view visual foundation model features, enabling unified decoding into multiple 3D formats (e.g., radiance fields, 3D Gaussians, explicit meshes). Methodologically, SLAT pioneers a 2B-parameter Transformer architecture based on Rectified Flow for large-scale 3D latent-space modelingโ€”the first of its kind. We curate a high-quality dataset of 500K 3D assets and perform end-to-end training. SLAT supports text- and image-conditioned generation, achieving state-of-the-art performance in fidelity, diversity, and editability. It enables real-time local 3D editing and dynamic output format switching. All code, models, and data are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.
Problem

Research questions and friction points this paper is trying to address.

Develops a scalable 3D generation method for diverse assets
Unifies representation for multiple output formats like meshes and radiance fields
Enables high-quality text/image-conditioned 3D generation and local editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Structured LATent representation for 3D generation
Sparse 3D grid with dense multiview visual features
Rectified flow transformers tailored for SLAT
๐Ÿ”Ž Similar Papers
2024-03-18European Conference on Computer VisionCitations: 70
Jianfeng Xiang
Jianfeng Xiang
Tsinghua University
Artificial Intelligence3D Computer Vision
Z
Zelong Lv
Microsoft Research
Sicheng Xu
Sicheng Xu
Microsoft Research Asia
Y
Yu Deng
Microsoft Research
Ruicheng Wang
Ruicheng Wang
Student
B
Bowen Zhang
USTC, Microsoft Research
D
Dong Chen
Microsoft Research
X
Xin Tong
Microsoft Research
Jiaolong Yang
Jiaolong Yang
Microsoft Research
3D Computer Vision