HuViDPO:Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment

📅 2025-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-video (T2V) generation suffers from misalignment with human preferences and limited high-quality paired preference data, constraining model performance. Method: We introduce Direct Preference Optimization (DPO) to T2V for the first time. To address data scarcity, we construct the first small-scale, action-category-level video preference dataset. We propose a novel framework integrating first-frame conditioning with sparse causal attention to enable temporally controllable and computationally efficient generation. Our method includes a T2V-specific DPO formulation, a fine-grained preference data construction paradigm, and a first-frame-guided diffusion model fine-tuning strategy. Contribution/Results: Experiments demonstrate significant improvements in video aesthetic quality, human preference alignment, temporal coherence, and generation flexibility under limited annotations, while reducing training cost. This work establishes a foundation for preference-driven T2V learning.

Technology Category

Application Category

📝 Abstract
With the rapid development of AIGC technology, significant progress has been made in diffusion model-based technologies for text-to-image (T2I) and text-to-video (T2V). In recent years, a few studies have introduced the strategy of Direct Preference Optimization (DPO) into T2I tasks, significantly enhancing human preferences in generated images. However, existing T2V generation methods lack a well-formed pipeline with exact loss function to guide the alignment of generated videos with human preferences using DPO strategies. Additionally, challenges such as the scarcity of paired video preference data hinder effective model training. At the same time, the lack of training datasets poses a risk of insufficient flexibility and poor video generation quality in the generated videos. Based on those problems, our work proposes three targeted solutions in sequence. 1) Our work is the first to introduce the DPO strategy into the T2V tasks. By deriving a carefully structured loss function, we utilize human feedback to align video generation with human preferences. We refer to this new method as HuViDPO. 2) Our work constructs small-scale human preference datasets for each action category and fine-tune this model, improving the aesthetic quality of the generated videos while reducing training costs. 3) We adopt a First-Frame-Conditioned strategy, leveraging the rich in formation from the first frame to guide the generation of subsequent frames, enhancing flexibility in video generation. At the same time, we employ a SparseCausal Attention mechanism to enhance the quality of the generated videos.More details and examples can be accessed on our website: https://tankowa.github.io/HuViDPO. github.io/.
Problem

Research questions and friction points this paper is trying to address.

Lack of DPO strategy in T2V generation
Scarcity of paired video preference data
Insufficient flexibility and poor video quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DPO strategy in T2V tasks
Constructs human preference datasets for training
Uses First-Frame-Conditioned strategy for video generation
🔎 Similar Papers
No similar papers found.
Lifan Jiang
Lifan Jiang
Zhejiang University
AI generation
B
Boxi Wu
State Key Laboratory of CAD & CG, Zhejiang University
J
Jiahui Zhang
State Key Laboratory of CAD & CG, Zhejiang University
X
Xiaotong Guan
College of Software Technology, Zhejiang University
S
Shuang Chen
State Key Laboratory of CAD & CG, Zhejiang University