🤖 AI Summary
Fine-grained optimization of video diffusion models typically relies on costly human annotations and substantial computational resources.
Method: This paper proposes an automatic, feedback-driven lightweight fine-tuning framework that requires zero human annotations and only ~4 GPU-hours. It introduces the first unsupervised automatic feedback paradigm, featuring a weakness-guided, prompt-driven data engine and an adaptive sample weighting mechanism integrating vision-language model (VLM)-based rewards with physical plausibility constraints. The approach synergistically combines video diffusion fine-tuning, VLM-based discrimination, reward modeling, and synthetic data distillation.
Contribution/Results: Evaluated on VBench-2.0 using the Wan2.1 baseline, our method achieves an average improvement of ~4% across 17 quality dimensions—without any human annotations and with minimal reliance on real-world data. The code, models, and datasets will be publicly released.
📝 Abstract
Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.