SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing video diffusion distillation methods suffer from high training costs and overly conservative motion dynamics under few-step inference. This work proposes SGMD, a novel approach that directly approximates the teacher model through a pseudo-score perspective and introduces a stop-gradient Fisher information objective to establish stable distribution matching. To jointly optimize training efficiency and temporal dynamics preservation, SGMD further incorporates a dual-potential mechanism comprising negative residual (NR) and residual contraction (RC). Compared to DMD2, SGMD achieves approximately 3× faster training, and its 4-step distilled model significantly enhances motion expressiveness while maintaining temporal consistency. Human evaluations confirm its superiority in both motion quality and overall preference.

📝 Abstract

Distribution Matching Distillation (DMD) is a widely used paradigm for accelerating inference in few-step video diffusion models. However, DMD-style video distillation faces two coupled challenges: the fake score must track a continuously evolving generator, making training costly when frequent updates are required, while reverse-KL-style matching can be mode-seeking and conservative for preserving strong motion dynamics. To address these issues, we propose \textbf{Score Gradient Matching Distillation (SGMD)}. SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately $\sim 3\times$ training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency. A human study confirms that SGMD is preferred in motion quality and overall preference, while visual quality and text alignment remain comparable. Code is available at https://github.com/ModelTC/LightX2V.

Problem

Research questions and friction points this paper is trying to address.

video diffusion distillation

few-step inference

score matching

motion dynamics

distribution matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Score Gradient Matching

Video Diffusion Distillation

Few-Step Generation