DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two critical limitations of Direct Preference Optimization (DPO) in post-training text-to-video diffusion models—coarse-grained video-pair preference annotations inducing motion bias (e.g., annotator preference for low-motion segments) and sparse preference signals—this work proposes a fine-grained, low-bias preference optimization framework. Methodologically, it introduces: (1) a novel alignment-based video-pair construction paradigm leveraging realistic video perturbations to eliminate motion bias; (2) dense, short-temporal-clip-level preference modeling; and (3) a vision-language model (VLM)-driven, fully automated paragraph-level preference annotation pipeline. Experiments demonstrate that the framework achieves significantly improved motion generation quality using only one-third of manually annotated data, while matching standard DPO in text-video alignment, visual fidelity, and temporal consistency. Under VLM-automated annotation, performance approaches that of human annotation.

Technology Category

Application Category

📝 Abstract
Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.
Problem

Research questions and friction points this paper is trying to address.

Addresses bias towards low-motion clips in video preference annotation
Enables fine-grained temporal preference labeling for video diffusion models
Leverages VLMs for automatic segment-level preference annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Denoising ground truth videos for aligned pairs
Labeling preferences on short video segments
Using VLMs for automatic preference annotation
🔎 Similar Papers
No similar papers found.