🤖 AI Summary
Video generation models still suffer from limited naturalness and contextual consistency in modeling complex motion. To address this, we propose RealDPO—a novel alignment paradigm that introduces real-world videos into preference learning: authentic motion videos serve as positive samples, while generated videos act as negative samples, enabling a motion-fidelity–oriented DPO loss function optimized iteratively without human annotations. To support this framework, we curate RealAction-5K, a high-quality dataset of human daily activities, and perform post-training optimization on a Transformer-based video generation model. Experiments demonstrate significant improvements over state-of-the-art methods: FVD decreases by 18.7%, CLIPSIM increases by 12.3%, and user preference rate rises by 35%. Our approach effectively enhances motion realism, text-video alignment, and overall generation quality.
📝 Abstract
Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.