Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitively high training costs of foundational video generation models, this paper proposes an efficient training paradigm under resource constraints: training a medium-scale 7B-parameter diffusion model from scratch using only 665,000 H100 GPU-hours. Methodologically, we introduce a lightweight spatiotemporal modeling architecture, a progressive curriculum learning strategy, and a low-overhead fine-tuning/resumption mechanism. Our core contribution is the empirical validation of the “medium model superiority” hypothesis—demonstrating that our 7B model matches or surpasses billion-parameter competitors on multiple video generation benchmarks (e.g., WebVid, ModelScope), while exhibiting strong cross-task generalization and rapid adaptation capability. This design significantly lowers deployment barriers and computational overhead for downstream applications, offering a scalable and practical alternative to parameter-inefficient large models.

Technology Category

Application Category

📝 Abstract
This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/
Problem

Research questions and friction points this paper is trying to address.

Cost-efficient training of video generation foundation model
Competitive performance with moderate computational resources
Strong generalization ability for downstream applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cost-efficient 7B parameter video generation model
Trained with 665,000 H100 GPU hours
Strong generalization with lightweight fine-tuning
T
Team Seawead
ByteDance
Ceyuan Yang
Ceyuan Yang
The Chinese University of Hong Kong
Computer Vision
Zhijie Lin
Zhijie Lin
ByteDance Inc.
Machine learning
Y
Yang Zhao
ByteDance
Shanchuan Lin
Shanchuan Lin
ByteDance
Computer Science
Zhibei Ma
Zhibei Ma
University of Southern California
RoboticsArtificial IntelligenceAIGCMachine Learning
H
Haoyuan Guo
ByteDance
H
Hao Chen
ByteDance
Lu Qi
Lu Qi
Insta360 | Wuhan Univeristy
Computer VisionDeep Learning
S
Sen Wang
ByteDance
F
Feng Cheng
ByteDance
F
Feilong Zuo Xuejiao Zeng
ByteDance
Ziyan Yang
Ziyan Yang
Bytedance Seed
Computer VisionNatural Language Processing
F
Fangyuan Kong
ByteDance
Zhiwu Qing
Zhiwu Qing
Huazhong University of Science and Technology
Video Understanding
Fei Xiao
Fei Xiao
ByteDance
M
Meng Wei
ByteDance
T
Tuyen Hoang
ByteDance
Siyu Zhang
Siyu Zhang
4DV.ai
Computer Vision
Peihao Zhu
Peihao Zhu
ByteDance Seed | KAUST
Computer VisionComputer GraphicsDeep Learning
Q
Qi Zhao
ByteDance
J
Jiangqiao Yan
ByteDance
Liangke Gui
Liangke Gui
Google DeepMind
Computer VisionMachine Learning
Sheng Bi
Sheng Bi
Dalian University of Technology
SemiconductorOrganic Electronics
Jiashi Li
Jiashi Li
ByteDance Inc
Image/Video GenerationTrain/Infer Infra
Y
Yuxi Ren
ByteDance
R
Rui Wang
ByteDance
H
Huixia Li
ByteDance
Xuefeng Xiao
Xuefeng Xiao
ByteDance Seed
Computer VisionEfficient AI
S
Shu Liu
ByteDance
F
Feng Ling
ByteDance
H
Heng Zhang
ByteDance
H
Houmin Wei
ByteDance
Huafeng Kuang
Huafeng Kuang
ByteDance Inc.
Multimodal Understanding and GenerationAdversarial Robustness
J
Jerry Duncan
ByteDance
J
Junda Zhang
ByteDance
J
Junru Zheng
ByteDance
L
Li Sun
ByteDance
M
Manlin Zhang
ByteDance
R
Renfei Sun
ByteDance
Xiaobin Zhuang
Xiaobin Zhuang
Bytedance
Audio Generation
X
Xiaojie Li
ByteDance
X
Xin Xia
ByteDance
X
Xuyan Chi
ByteDance
Yanghua Peng
Yanghua Peng
ByteDance Inc.
Large Language ModelsMachine Learning SystemsGPU Scheduling
Y
Yuping Wang
ByteDance
Y
Yuxuan Wang
ByteDance
Zhongkai Zhao
Zhongkai Zhao
Bytedance
Machine Learning SystemsLLMSoftware Engineering
Z
Zhuo Chen
ByteDance
Zuquan Song
Zuquan Song
Bytedance
Zhenheng Yang
Zhenheng Yang
TikTok
Computer VisionMachine LearningDeep Learning
Jiashi Feng
Jiashi Feng
ByteDance Inc.
computer visionmachine learning
J
Jianchao Yang
ByteDance
Lu Jiang
Lu Jiang
Research Scientist @ Apple
Generative AIFoundation ModelRobust Deep LearningMultimediaVideo Generation