Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing text-to-video (T2V) diffusion models suffer from limitations in long-sequence modeling, controllability, and text-video alignment. To address these, we propose Step-Video-T2V—a bilingual (Chinese-English) 30B-parameter foundation model capable of generating high-fidelity videos up to 204 frames. Our method introduces three key innovations: (1) the first video-specific deep-compression VAE (Video-VAE), achieving 16×16 spatial and 8× temporal compression; (2) a 3D full-attention DiT architecture trained via Flow Matching; and (3) Video-DPO, a video-level direct preference optimization technique that enhances semantic consistency and controllability. We further release Step-Video-T2V-Eval—the first open-source T2V benchmark—on which Step-Video-T2V achieves state-of-the-art performance, significantly outperforming leading open-source and commercial models. All code, models, and evaluation datasets are publicly released.

Technology Category

Application Category

📝 Abstract

We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.

Problem

Research questions and friction points this paper is trying to address.

Develop advanced text-to-video generation model

Enhance video quality with deep compression techniques

Address limitations in current video foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep compression Variational Autoencoder

Bilingual text encoders

Video-DPO for visual quality

🔎 Similar Papers

From Image to Video, what do we need in multimodal LLMs?

2024-04-18arXiv.orgCitations: 8