Hunyuan3D 1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation

📅 2024-11-04
📈 Citations: 8
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D generation methods suffer from slow inference, limited modality support (e.g., text-only or image-only conditioning), and a fundamental trade-off between quality and efficiency. This paper introduces the first unified, dual-conditioned 3D generation framework jointly driven by text and image inputs, operating in two stages: (1) multi-view diffusion generates high-fidelity RGB views in ~4 seconds; (2) a noise-robust, feed-forward 3D reconstruction network recovers textured meshes with high fidelity in ~7 seconds. Key contributions include: (1) the first cross-modal unified conditioning architecture for joint text-image guidance; (2) a novel iterative-optimization-free feed-forward paradigm for multi-view-to-3D reconstruction; and (3) end-to-end generation in only ~11 seconds—substantially faster than state-of-the-art diffusion-based approaches—while achieving new SOTA performance in generation quality, geometric detail, and output diversity.

Technology Category

Application Category

📝 Abstract
While 3D generative models have greatly improved artists' workflows, the existing diffusion models for 3D generation suffer from slow generation and poor generalization. To address this issue, we propose a two-stage approach named Hunyuan3D 1.0 including a lite version and a standard version, that both support text- and image-conditioned generation. In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds. These multi-view images capture rich details of the 3D asset from different viewpoints, relaxing the tasks from single-view to multi-view reconstruction. In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset given the generated multi-view images in approximately 7 seconds. The reconstruction network learns to handle noises and in-consistency introduced by the multi-view diffusion and leverages the available information from the condition image to efficiently recover the 3D structure. Our framework involves the text-to-image model, i.e., Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation. Our standard version has 3x more parameters than our lite and other existing model. Our Hunyuan3D 1.0 achieves an impressive balance between speed and quality, significantly reducing generation time while maintaining the quality and diversity of the produced assets.
Problem

Research questions and friction points this paper is trying to address.

3D model generation
speed
application scope
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D model generation
text-to-image conversion
multi-angle 2D imaging
🔎 Similar Papers
No similar papers found.
X
Xianghui Yang
H
Huiwen Shi
B
Bowen Zhang
F
Fan Yang
Jiacheng Wang
Jiacheng Wang
Nanyang Technological University
ISACGenAILow-altitude wireless networkSemantic Communications
H
Hongxu Zhao
Xinhai Liu
Xinhai Liu
Tencent, Tencent.com, China
3D Computer Vision3D Reconstruction3D Generation
Xinzhou Wang
Xinzhou Wang
Tsinghua University, Tongji University
AIGC3D GenerationVideo Generation
Q
Qin Lin
J
Jiaao Yu
L
Lifu Wang
J
Jing Xu
Z
Zebin He
Z
Zhuo Chen
S
Si-Ya Liu
J
Junta Wu
Y
Yihang Lian
S
Shaoxiong Yang
Yuhong Liu
Yuhong Liu
Santa Clara University
Trustworthy AISecurity and PrivacyIoTBlockchainSocial network
Y
Yong Yang
D
Di Wang
J
Jie Jiang
C
Chunchao Guo