Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the text-guided image-to-video (TI2V) generation task by proposing Step-Video-TI2V—a 30B-parameter multimodal diffusion model supporting joint text-image conditioning and high-fidelity long-duration video synthesis (up to 102 frames). Methodologically, it employs a large-scale multimodal Transformer architecture that unifies cross-modal text-image representation learning with a temporally extended video diffusion process. Key contributions include: (1) introducing Step-Video-TI2V-Eval—the first dedicated benchmark for TI2V evaluation; (2) open-sourcing the model and conducting comprehensive, systematic comparisons across leading open-source and commercial TI2V approaches; and (3) achieving state-of-the-art performance on Step-Video-TI2V-Eval, with significant improvements in visual fidelity, temporal coherence, and semantic alignment between input conditions and generated videos.

Technology Category

Application Category

📝 Abstract

We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.

Problem

Research questions and friction points this paper is trying to address.

Develops a text-driven image-to-video generation model.

Introduces a new benchmark for evaluating video generation.

Compares model performance with existing TI2V engines.

Innovation

Methods, ideas, or system contributions that make the work stand out.

30B parameter text-driven image-to-video model

Generates 102-frame videos from text and images

Introduces Step-Video-TI2V-Eval benchmark for evaluation

🔎 Similar Papers

No similar papers found.