Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

To address the prohibitively high training costs of foundational video generation models, this paper proposes an efficient training paradigm under resource constraints: training a medium-scale 7B-parameter diffusion model from scratch using only 665,000 H100 GPU-hours. Methodologically, we introduce a lightweight spatiotemporal modeling architecture, a progressive curriculum learning strategy, and a low-overhead fine-tuning/resumption mechanism. Our core contribution is the empirical validation of the “medium model superiority” hypothesis—demonstrating that our 7B model matches or surpasses billion-parameter competitors on multiple video generation benchmarks (e.g., WebVid, ModelScope), while exhibiting strong cross-task generalization and rapid adaptation capability. This design significantly lowers deployment barriers and computational overhead for downstream applications, offering a scalable and practical alternative to parameter-inefficient large models.

Technology Category

Application Category

📝 Abstract

This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/

Problem

Research questions and friction points this paper is trying to address.

Cost-efficient training of video generation foundation model

Competitive performance with moderate computational resources

Strong generalization ability for downstream applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cost-efficient 7B parameter video generation model

Trained with 665,000 H100 GPU hours

Strong generalization with lightweight fine-tuning

🔎 Similar Papers

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way