Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

📅 2025-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-video (T2V) diffusion models suffer from limitations in long-sequence modeling, controllability, and text-video alignment. To address these, we propose Step-Video-T2V—a bilingual (Chinese-English) 30B-parameter foundation model capable of generating high-fidelity videos up to 204 frames. Our method introduces three key innovations: (1) the first video-specific deep-compression VAE (Video-VAE), achieving 16×16 spatial and 8× temporal compression; (2) a 3D full-attention DiT architecture trained via Flow Matching; and (3) Video-DPO, a video-level direct preference optimization technique that enhances semantic consistency and controllability. We further release Step-Video-T2V-Eval—the first open-source T2V benchmark—on which Step-Video-T2V achieves state-of-the-art performance, significantly outperforming leading open-source and commercial models. All code, models, and evaluation datasets are publicly released.

Technology Category

Application Category

📝 Abstract
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
Problem

Research questions and friction points this paper is trying to address.

Develop advanced text-to-video generation model
Enhance video quality with deep compression techniques
Address limitations in current video foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep compression Variational Autoencoder
Bilingual text encoders
Video-DPO for visual quality
🔎 Similar Papers
No similar papers found.
G
Guoqing Ma
StepFun
Haoyang Huang
Haoyang Huang
JD Explore Academy (present) | StepFun | Microsoft Research
Multimodal & Multilingual Foundation Model
Kun Yan
Kun Yan
Beihang University & Microsoft Research
Natural Language ProcessingComputer Vision
L
Liangyu Chen
StepFun
Nan Duan
Nan Duan
JD.Com (now) | StepFun | Microsoft Research
NLPArtificial General Intelligence
Shengming Yin
Shengming Yin
University of Science and Technology of China
computer vision
C
Changyi Wan
StepFun
R
Ranchen Ming
StepFun
X
Xiaoniu Song
StepFun
X
Xing Chen
StepFun
Y
Yu Zhou
StepFun
D
Deshan Sun
StepFun
Deyu Zhou
Deyu Zhou
Professor, School of computer science and engineering, SEU
natural language processing
J
Jian Zhou
StepFun
K
Kaijun Tan
StepFun
K
Kang An
StepFun
M
Mei Chen
StepFun
W
Wei Ji
StepFun
Q
Qiling Wu
StepFun
W
Wen Sun
StepFun
X
Xin Han
StepFun
Y
Yanan Wei
StepFun
Zheng Ge
Zheng Ge
Senior Researcher, StepFun
Multimodal Models Perception and Reasoning
A
Aojie Li
StepFun
B
Bin Wang
StepFun
B
Bizhu Huang
StepFun
B
Bo Wang
StepFun
B
Brian Li
StepFun
C
Changxing Miao
StepFun
C
Chen Xu
StepFun
C
Chenfei Wu
StepFun
C
Chenguang Yu
StepFun
D
Dapeng Shi
StepFun
D
Dingyuan Hu
StepFun
E
Enle Liu
StepFun
G
Gang Yu
StepFun
G
Ge Yang
StepFun
G
Guanzhe Huang
StepFun
G
Gulin Yan
StepFun
H
Haiyang Feng
StepFun
Hao Nie
Hao Nie
Stepfun
H
Haonan Jia
StepFun
Hanpeng Hu
Hanpeng Hu
The University of Hong Kong
Distributed MLML Diagnosis and Optimization
H
Hanqi Chen
StepFun
H
Haolong Yan
StepFun
H
Heng Wang
StepFun
Hongcheng Guo
Hongcheng Guo
School of Data Science, Fudan University
LLMsMultimodal LLMs
H
Huilin Xiong
StepFun
Hui Xiong
Hui Xiong
Senior Scientist, Candela Corporation
Ultrafast dynamicsatomic molecular physicsfree electron laser
J
Jiahao Gong
StepFun
Jianchang Wu
Jianchang Wu
StepFun
J
Jiaoren Wu
StepFun
J
Jie Wu
StepFun
J
Jie Yang
StepFun
J
Jiashuai Liu
StepFun
J
Jiashuo Li
StepFun
J
Jingyang Zhang
StepFun
J
Junjing Guo
StepFun
J
Junzhe Lin
StepFun
K
Kaixiang Li
StepFun
L
Lei Liu
StepFun
L
Lei Xia
StepFun
L
Liang Zhao
StepFun
L
Liguo Tan
StepFun
L
Liwen Huang
StepFun
L
Liying Shi
StepFun
M
Ming Li
StepFun
Mingliang Li
Mingliang Li
Tsinghua University
Computer System
M
Muhua Cheng
StepFun
N
Na Wang
StepFun
Q
Qiaohui Chen
StepFun
Q
Qinglin He
StepFun
Q
Qiuyan Liang
StepFun
Q
Quan Sun
StepFun
R
Ran Sun
StepFun
R
Rui Wang
StepFun
S
Shaoliang Pang
StepFun
S
Shiliang Yang
StepFun
Sitong Liu
Sitong Liu
Duke University
S
Siqi Liu
StepFun
S
Shuli Gao
StepFun
Tiancheng Cao
Tiancheng Cao
Schmidt AI in Science Fellow, CSIE, Nanyang Technological University
Neuromorphic computingEdge IntelligenceInternet of Medical Things (IoMT)Translational medicine
T
Tianyu Wang
StepFun
W
Weipeng Ming
StepFun
W
Wenqing He
StepFun
X
Xu Zhao
StepFun
X
Xuelin Zhang
StepFun
X
Xianfang Zeng
StepFun
X
Xiaojia Liu
StepFun
X
Xuan Yang
StepFun
Y
Yaqi Dai
StepFun
Y
Yanbo Yu
StepFun
Y
Yang Li
StepFun
Y
Yineng Deng
StepFun
Y
Yingming Wang
StepFun
Yilei Wang
Yilei Wang
Alibaba Cloud
Y
Yuanwei Lu
StepFun
Y
Yu Chen
StepFun
Y
Yu Luo
StepFun
Y
Yuchu Luo
StepFun
Y
Yuhe Yin
StepFun
Y
Yuheng Feng
StepFun
Y
Yuxiang Yang
StepFun
Z
Zecheng Tang
StepFun
Z
Zekai Zhang
StepFun
Z
Zidong Yang
StepFun
B
Binxing Jiao
StepFun
Jiansheng Chen
Jiansheng Chen
School of Computer and Communication Engineering, University of Science and Technology Beijing
Computer VisionMachine Learning
J
Jing Li
StepFun
Shuchang Zhou
Shuchang Zhou
Megvii Inc.
Artificial Intelligence
X
Xiangyu Zhang
StepFun
Xinhao Zhang
Xinhao Zhang
PHD student, Portland State University
Data MiningReinforcement Learning
Y
Yibo Zhu
StepFun
H
H. Shum
StepFun
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models