HunyuanVideo: A Systematic Framework For Large Video Generative Models

📅 2024-12-03
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Open-source video generation models significantly lag behind proprietary counterparts, exacerbating the quality gap between industry and users. Method: We introduce the first ultra-large-scale open-source video foundation model (13B+ parameters), spanning the full stack—from dataset curation and architecture design to progressive training and efficient inference. Key innovations include a spatiotemporally decoupled diffusion architecture, multi-stage data cleaning and synthetic data augmentation, progressive scaling training, and a lightweight inference engine. Contribution/Results: Our model achieves state-of-the-art performance among open-source models across visual fidelity, motion coherence, text-video alignment accuracy, and camera motion modeling—surpassing Runway Gen-3, Luma 1.6, and three leading domestic SOTA models. Fully open-sourced code and weights foster fair, reproducible, and sustainable community advancement in video generation research.

Technology Category

Application Category

📝 Abstract
Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at https://github.com/Tencent/HunyuanVideo.
Problem

Research questions and friction points this paper is trying to address.

Video Quality Enhancement
Code Accessibility
Advanced Videography Techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source Video Generation
Large-scale Model Training
Realistic Video Production
🔎 Similar Papers
No similar papers found.
W
Weijie Kong
Q
Qi Tian
Z
Zijian Zhang
R
Rox Min
Z
Zuozhuo Dai
J
Jin Zhou
Jiangfeng Xiong
Jiangfeng Xiong
Tencent
AIGC
X
Xin Li
B
Bo Wu
J
Jianwei Zhang
K
Kathrina Wu
Q
Qin Lin
A
Aladdin Wang
Andong Wang
Andong Wang
Jiawang Bai
Jiawang Bai
Tsinghua University
AIGC
Changlin Li
Changlin Li
Tencent
Deep LearningComputer Vision
Duojun Huang
Duojun Huang
Sun Yat-sen University
Computer Vision
F
Fang Yang
Hao Tan
Hao Tan
Adobe Research
Vision and Language3D Multimodal
H
Hongmei Wang
J
Jacob Song
Jiawang Bai
Jiawang Bai
Tsinghua University
AIGC
J
Jianbing Wu
J
Jinbao Xue
J
Joey Wang
Junkun Yuan
Junkun Yuan
Research Scientist, Tencent
Computer VisionMultimodal AIGenerative AI
K
Kai Wang
Mengyang Liu
Mengyang Liu
City University of Hong Kong
Deep LearningComputer VisionAIGC
P
Pengyu Li
S
Shuai Li
Weiyan Wang
Weiyan Wang
Tencent
Machine Learning SystemHigh Performance Computing
W
Wenqing Yu
X
Xinchi Deng
Yanxin Long
Yanxin Long
Tencent; Sun Yat-sen University
Computer VisionVision+language
Y
Yi Chen
Yutao Cui
Yutao Cui
Tencent Hunyuan
Generative ModelsMulti-ModalObject Tracking
Y
Yuanbo Peng
Zhentao Yu
Zhentao Yu
Researcher, Tencent Hunyuan
Computer vision
Zhiyu He
Zhiyu He
Tsinghua University
recommendation
Zhiyong Xu
Zhiyong Xu
Z
Zixiang Zhou
Zunnan Xu
Zunnan Xu
Tsinghua University
Computer VisionMachine Learning
Y
Yangyu Tao
Q
Qinglin Lu
S
Songtao Liu
Daquan Zhou
Daquan Zhou
Bytedance, US
Artificial IntelligenceDeep learning
H
Hongfa Wang
Y
Yong Yang
D
Di Wang
Yuhong Liu
Yuhong Liu
Santa Clara University
Trustworthy AISecurity and PrivacyIoTBlockchainSocial network
J
Jie Jiang
C
Caesar Zhong