🤖 AI Summary
This work addresses the low data efficiency and evaluation uncertainty inherent in direct preference optimization (DPO) for video diffusion models. To this end, we propose the first discriminator-free video DPO framework. Methodologically, we eliminate reliance on human annotations or learned discriminators, instead automatically constructing high-quality win/loss preference pairs by leveraging real videos and their edited variants—such as temporal reversal, frame shuffling, and noise injection—enabling scalable, infinite preference data generation. We theoretically establish that our framework maintains preference alignment robustness even under distribution mismatch. Experiments on CogVideoX demonstrate substantial improvements in training efficiency and significant suppression of temporal artifacts—including flickering and motion discontinuity—yielding superior generation quality and temporal stability over baselines. Key contributions include: (1) an editing-driven pseudo-negative sampling paradigm; (2) a discriminator-free video DPO architecture; and (3) theoretical guarantees under distribution shift.
📝 Abstract
Direct Preference Optimization (DPO), which aligns models with human preferences through win/lose data pairs, has achieved remarkable success in language and image generation. However, applying DPO to video diffusion models faces critical challenges: (1) Data inefficiency. Generating thousands of videos per DPO iteration incurs prohibitive costs; (2) Evaluation uncertainty. Human annotations suffer from subjective bias, and automated discriminators fail to detect subtle temporal artifacts like flickering or motion incoherence. To address these, we propose a discriminator-free video DPO framework that: (1) Uses original real videos as win cases and their edited versions (e.g., reversed, shuffled, or noise-corrupted clips) as lose cases; (2) Trains video diffusion models to distinguish and avoid artifacts introduced by editing. This approach eliminates the need for costly synthetic video comparisons, provides unambiguous quality signals, and enables unlimited training data expansion through simple editing operations. We theoretically prove the framework's effectiveness even when real videos and model-generated videos follow different distributions. Experiments on CogVideoX demonstrate the efficiency of the proposed method.