Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

πŸ“… 2025-03-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the text-guided image-to-video (TI2V) generation task by proposing Step-Video-TI2Vβ€”a 30B-parameter multimodal diffusion model supporting joint text-image conditioning and high-fidelity long-duration video synthesis (up to 102 frames). Methodologically, it employs a large-scale multimodal Transformer architecture that unifies cross-modal text-image representation learning with a temporally extended video diffusion process. Key contributions include: (1) introducing Step-Video-TI2V-Evalβ€”the first dedicated benchmark for TI2V evaluation; (2) open-sourcing the model and conducting comprehensive, systematic comparisons across leading open-source and commercial TI2V approaches; and (3) achieving state-of-the-art performance on Step-Video-TI2V-Eval, with significant improvements in visual fidelity, temporal coherence, and semantic alignment between input conditions and generated videos.

Technology Category

Application Category

πŸ“ Abstract
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.
Problem

Research questions and friction points this paper is trying to address.

Develops a text-driven image-to-video generation model.
Introduces a new benchmark for evaluating video generation.
Compares model performance with existing TI2V engines.
Innovation

Methods, ideas, or system contributions that make the work stand out.

30B parameter text-driven image-to-video model
Generates 102-frame videos from text and images
Introduces Step-Video-TI2V-Eval benchmark for evaluation
πŸ”Ž Similar Papers
No similar papers found.
Haoyang Huang
Haoyang Huang
JD Explore Academy (present) | StepFun | Microsoft Research
Multimodal & Multilingual Foundation Model
G
Guoqing Ma
Nan Duan
Nan Duan
JD.Com (now) | StepFun | Microsoft Research
NLPArtificial General Intelligence
X
Xing Chen
C
Changyi Wan
R
Ranchen Ming
T
Tianyu Wang
B
Bo Wang
Z
Zhiying Lu
A
Aojie Li
X
Xianfang Zeng
Xinhao Zhang
Xinhao Zhang
PHD student, Portland State University
Data MiningReinforcement Learning
G
Gang Yu
Y
Yuhe Yin
Q
Qiling Wu
W
Wen Sun
K
Kang An
X
Xin Han
D
Deshan Sun
W
Wei Ji
B
Bizhu Huang
B
Brian Li
C
Chenfei Wu
G
Guanzhe Huang
Hui Xiong
Hui Xiong
Senior Scientist, Candela Corporation
Ultrafast dynamicsatomic molecular physicsfree electron laser
J
Jiaxin He
Jianchang Wu
Jianchang Wu
J
Jianlong Yuan
J
Jie Wu
J
Jiashuai Liu
J
Junjing Guo
K
Kaijun Tan
L
Liangyu Chen
Q
Qiaohui Chen
R
Ran Sun
S
Shanshan Yuan
Shengming Yin
Shengming Yin
University of Science and Technology of China
computer vision
Sitong Liu
Sitong Liu
Duke University
W
Wei Chen
Y
Yaqi Dai
Y
Yuchu Luo
Zheng Ge
Zheng Ge
Senior Researcher, StepFun
Multimodal Models Perception and Reasoning
Z
Zhi-Ying Guan
X
Xiaoniu Song
Y
Yu Zhou
B
Binxing Jiao
Jiansheng Chen
Jiansheng Chen
School of Computer and Communication Engineering, University of Science and Technology Beijing
Computer VisionMachine Learning
J
Jing Li
Shuchang Zhou
Shuchang Zhou
Megvii Inc.
Artificial Intelligence
X
Xiangyu Zhang
Y
Yi Xiu
Y
Yibo Zhu
H
H. Shum
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models