π€ AI Summary
This work addresses the text-guided image-to-video (TI2V) generation task by proposing Step-Video-TI2Vβa 30B-parameter multimodal diffusion model supporting joint text-image conditioning and high-fidelity long-duration video synthesis (up to 102 frames). Methodologically, it employs a large-scale multimodal Transformer architecture that unifies cross-modal text-image representation learning with a temporally extended video diffusion process. Key contributions include: (1) introducing Step-Video-TI2V-Evalβthe first dedicated benchmark for TI2V evaluation; (2) open-sourcing the model and conducting comprehensive, systematic comparisons across leading open-source and commercial TI2V approaches; and (3) achieving state-of-the-art performance on Step-Video-TI2V-Eval, with significant improvements in visual fidelity, temporal coherence, and semantic alignment between input conditions and generated videos.
π Abstract
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.