ContentV: Efficient Training of Video Generation Models with Limited Compute

๐Ÿ“… 2025-06-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

215K/year
๐Ÿค– AI Summary
To address the prohibitively high training costs of video generation models, this paper introduces ContentVโ€”a text-to-video diffusion model with 8 billion parameters. Methodologically, ContentV features: (1) a minimal temporal extension architecture that reuses a pre-trained image diffusion model, eliminating the need for de novo training; (2) a multi-stage curriculum training strategy based on flow matching to accelerate convergence; and (3) a human-annotation-free RLHF framework for quality optimization, leveraging synthetic feedback to enhance video fidelity and temporal consistency. Trained exclusively on 256ร—64GB NPUs for four weeks, ContentV achieves a state-of-the-art VBench score of 85.14โ€”marking a substantial reduction in computational overhead. The model weights and training code are fully open-sourced.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at: https://contentv.github.io.
Problem

Research questions and friction points this paper is trying to address.

Efficient training of video generation models with limited compute
Achieving high-quality video generation from text prompts
Reducing computational costs without sacrificing performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reuse pre-trained image models for video
Multi-stage training with flow matching
Cost-effective RLHF without extra annotations
๐Ÿ”Ž Similar Papers
No similar papers found.
W
Wenfeng Lin
ByteDance Douyin Content Group
R
Renjie Chen
ByteDance Douyin Content Group
B
Boyuan Liu
ByteDance Douyin Content Group
S
Shiyue Yan
ByteDance Douyin Content Group
Ruoyu Feng
Ruoyu Feng
University of Science and Technology of China
Generative ModelsComputer VisionImage/Video Coding for Machine
J
Jiangchuan Wei
ByteDance Douyin Content Group
Y
Yichen Zhang
ByteDance Douyin Content Group
Y
Yimeng Zhou
ByteDance Douyin Content Group
Chao Feng
Chao Feng
University of Zurich
networkmachine learningcybersecurity
J
Jiao Ran
ByteDance Douyin Content Group
Q
Qi Wu
ByteDance Douyin Content Group
Z
Zuotao Liu
ByteDance Douyin Content Group
M
Mingyu Guo
ByteDance Douyin Content Group