GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-grained optimization of video diffusion models typically relies on costly human annotations and substantial computational resources. Method: This paper proposes an automatic, feedback-driven lightweight fine-tuning framework that requires zero human annotations and only ~4 GPU-hours. It introduces the first unsupervised automatic feedback paradigm, featuring a weakness-guided, prompt-driven data engine and an adaptive sample weighting mechanism integrating vision-language model (VLM)-based rewards with physical plausibility constraints. The approach synergistically combines video diffusion fine-tuning, VLM-based discrimination, reward modeling, and synthetic data distillation. Contribution/Results: Evaluated on VBench-2.0 using the Wan2.1 baseline, our method achieves an average improvement of ~4% across 17 quality dimensions—without any human annotations and with minimal reliance on real-world data. The code, models, and datasets will be publicly released.

Technology Category

Application Category

📝 Abstract
Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Enhancing video generation quality without human annotations
Reducing computational resources for fine-tuning video models
Improving specific video dimensions via automatic feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic feedback for video generation fine-tuning
Prompt-driven data engine for diverse samples
Reward-guided training with realism constraints
🔎 Similar Papers
No similar papers found.
X
Xiaoyi Bao
GigaAI, Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences
Jindi Lv
Jindi Lv
Sichuan university
deep learningneural architecture searchmultimodal
X
Xiaofeng Wang
GigaAI
Z
Zheng Zhu
GigaAI
Xinze Chen
Xinze Chen
Unknown affiliation
Y
Yukun Zhou
GigaAI
Jiancheng Lv
Jiancheng Lv
University of Science and Technology of China
Operations ManagementMarketing
X
Xingang Wang
Institute of Automation, Chinese Academy of Sciences
G
Guan Huang
GigaAI