GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Fine-grained optimization of video diffusion models typically relies on costly human annotations and substantial computational resources. Method: This paper proposes an automatic, feedback-driven lightweight fine-tuning framework that requires zero human annotations and only ~4 GPU-hours. It introduces the first unsupervised automatic feedback paradigm, featuring a weakness-guided, prompt-driven data engine and an adaptive sample weighting mechanism integrating vision-language model (VLM)-based rewards with physical plausibility constraints. The approach synergistically combines video diffusion fine-tuning, VLM-based discrimination, reward modeling, and synthetic data distillation. Contribution/Results: Evaluated on VBench-2.0 using the Wan2.1 baseline, our method achieves an average improvement of ~4% across 17 quality dimensions—without any human annotations and with minimal reliance on real-world data. The code, models, and datasets will be publicly released.

Technology Category

Application Category

📝 Abstract

Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Enhancing video generation quality without human annotations

Reducing computational resources for fine-tuning video models

Improving specific video dimensions via automatic feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic feedback for video generation fine-tuning

Prompt-driven data engine for diverse samples

Reward-guided training with realism constraints

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling