HunyuanImage 3.0 Technical Report

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unifying multimodal understanding and generation within a single autoregressive framework. To this end, we propose a native chain-of-thought architecture and a progressive training strategy, enabling—for the first time in open-source image generation—the successful training and deployment of an over-80-billion-parameter Mixture-of-Experts (MoE) model, with only 13 billion parameters activated per inference step. Leveraging large-scale high-quality multimodal data curation, distributed training system optimizations, and aggressive post-training techniques, we significantly improve text–image alignment and generative visual fidelity. Comprehensive human and automated evaluations demonstrate performance on par with state-of-the-art proprietary models. All code, model weights, and detailed training configurations are publicly released, establishing a foundational infrastructure and introducing a novel paradigm for multimodal foundation model research.

Technology Category

Application Category

📝 Abstract
We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0
Problem

Research questions and friction points this paper is trying to address.

Unifies multimodal understanding and generation in autoregressive framework
Develops efficient infrastructure for large-scale training and inference
Creates open-source image generation model rivaling state-of-the-art performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies multimodal understanding and generation autoregressively
Employs Mixture-of-Experts with 80B total parameters
Uses progressive pre-training and aggressive post-training
🔎 Similar Papers
No similar papers found.
S
Siyu Cao
Tencent Hunyuan Foundation Model Team
Hangting Chen
Hangting Chen
Tencent Hunyuan
signal processingspeech separationDCASE
P
Peng Chen
Tencent Hunyuan Foundation Model Team
Yiji Cheng
Yiji Cheng
Tsinghua University
Computer VisionGenerative Models
Yutao Cui
Yutao Cui
Tencent Hunyuan
Generative ModelsMulti-ModalObject Tracking
X
Xinchi Deng
Tencent Hunyuan Foundation Model Team
Y
Ying Dong
Tencent Hunyuan Foundation Model Team
K
Kipper Gong
Tencent Hunyuan Foundation Model Team
T
Tianpeng Gu
Tencent Hunyuan Foundation Model Team
X
Xiusen Gu
Tencent Hunyuan Foundation Model Team
Tiankai Hang
Tiankai Hang
Southeast University & Microsoft Research Asia
Computer VisionGenerative ModelsDeep LearningArtificial Intelligence
Duojun Huang
Duojun Huang
Sun Yat-sen University
Computer Vision
J
Jie Jiang
Tencent Hunyuan Foundation Model Team
Zhengkai Jiang
Zhengkai Jiang
Tencent Hunyuan
RLHFDiffusion Models
W
Weijie Kong
Tencent Hunyuan Foundation Model Team
Changlin Li
Changlin Li
Tencent
Deep LearningComputer Vision
D
Donghao Li
Tencent Hunyuan Foundation Model Team
J
Junzhe Li
Tencent Hunyuan Foundation Model Team
X
Xin Li
Tencent Hunyuan Foundation Model Team
Y
Yang Li
Tencent Hunyuan Foundation Model Team
Z
Zhenxi Li
Tencent Hunyuan Foundation Model Team
Zhimin Li
Zhimin Li
Vanderbilt University
VisualizationHPCMachine Learning
Jiaxin Lin
Jiaxin Lin
The University of Texas at Austin
Computer Science
L
Linus
Tencent Hunyuan Foundation Model Team
L
Lucaz Liu
Tencent Hunyuan Foundation Model Team