Discriminator-Free Direct Preference Optimization for Video Diffusion

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the low data efficiency and evaluation uncertainty inherent in direct preference optimization (DPO) for video diffusion models. To this end, we propose the first discriminator-free video DPO framework. Methodologically, we eliminate reliance on human annotations or learned discriminators, instead automatically constructing high-quality win/loss preference pairs by leveraging real videos and their edited variants—such as temporal reversal, frame shuffling, and noise injection—enabling scalable, infinite preference data generation. We theoretically establish that our framework maintains preference alignment robustness even under distribution mismatch. Experiments on CogVideoX demonstrate substantial improvements in training efficiency and significant suppression of temporal artifacts—including flickering and motion discontinuity—yielding superior generation quality and temporal stability over baselines. Key contributions include: (1) an editing-driven pseudo-negative sampling paradigm; (2) a discriminator-free video DPO architecture; and (3) theoretical guarantees under distribution shift.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO), which aligns models with human preferences through win/lose data pairs, has achieved remarkable success in language and image generation. However, applying DPO to video diffusion models faces critical challenges: (1) Data inefficiency. Generating thousands of videos per DPO iteration incurs prohibitive costs; (2) Evaluation uncertainty. Human annotations suffer from subjective bias, and automated discriminators fail to detect subtle temporal artifacts like flickering or motion incoherence. To address these, we propose a discriminator-free video DPO framework that: (1) Uses original real videos as win cases and their edited versions (e.g., reversed, shuffled, or noise-corrupted clips) as lose cases; (2) Trains video diffusion models to distinguish and avoid artifacts introduced by editing. This approach eliminates the need for costly synthetic video comparisons, provides unambiguous quality signals, and enables unlimited training data expansion through simple editing operations. We theoretically prove the framework's effectiveness even when real videos and model-generated videos follow different distributions. Experiments on CogVideoX demonstrate the efficiency of the proposed method.

Problem

Research questions and friction points this paper is trying to address.

High cost of generating videos per DPO iteration

Subjective bias in human annotations for video quality

Automated discriminators fail to detect temporal artifacts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses real and edited videos for training

Trains models to avoid editing artifacts

Eliminates need for synthetic comparisons

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling

2024-10-08arXiv.orgCitations: 31

Real-Time Video Generation with Pyramid Attention Broadcast

2024-08-22arXiv.orgCitations: 16

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence