MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Video diffusion models often suffer from motion jitter, ghosting, and dynamic distortion due to insufficient temporal consistency supervision. To address this, we propose an adversarial post-training framework that requires no human annotations or reward models. For the first time, we introduce a lightweight optical flow discriminator—built upon the DiT architecture—into motion optimization, jointly leveraging adversarial training and distribution-matching regularization. Our method significantly improves temporal coherence while preserving visual fidelity and generation efficiency. Evaluated on the VBench motion sub-benchmark, our 3-step distilled video diffusion model outperforms the 50-step teacher model by +7.3% and the 3-step DMD baseline by +13.3%. Human preference studies further confirm superior motion quality over existing approaches. The core innovation lies in a flow-guided, lightweight adversarial post-training paradigm, enabling high-realism dynamic modeling under ultra-low-step generation.

Technology Category

Application Category

📝 Abstract

Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.

Problem

Research questions and friction points this paper is trying to address.

Improving motion coherence and realism in video diffusion models

Addressing jitter and ghosting issues without human preference data

Enhancing temporal consistency while preserving visual fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-training with optical-flow discriminator for motion realism

Distribution-matching regularizer to preserve visual fidelity

Built on 3-step distilled video diffusion model

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

Research Scientist Intern (TikTok-Privacy Innovation Lab-Multimodal Generative Model) - 2026 Start (PhD)

TikTok

San Jose, California

AI Research Scientist, Video Generation and Post Training, FAIR