🤖 AI Summary
Video diffusion models often suffer from motion jitter, ghosting, and dynamic distortion due to insufficient temporal consistency supervision. To address this, we propose an adversarial post-training framework that requires no human annotations or reward models. For the first time, we introduce a lightweight optical flow discriminator—built upon the DiT architecture—into motion optimization, jointly leveraging adversarial training and distribution-matching regularization. Our method significantly improves temporal coherence while preserving visual fidelity and generation efficiency. Evaluated on the VBench motion sub-benchmark, our 3-step distilled video diffusion model outperforms the 50-step teacher model by +7.3% and the 3-step DMD baseline by +13.3%. Human preference studies further confirm superior motion quality over existing approaches. The core innovation lies in a flow-guided, lightweight adversarial post-training paradigm, enabling high-realism dynamic modeling under ultra-low-step generation.
📝 Abstract
Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.