MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Portrait Few-Step Synthesis

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address distribution shift and training instability caused by few-step sampling in large-scale portrait video synthesis, this paper proposes a Weak-to-Strong Video Distillation (W2SVD) framework. W2SVD integrates LoRA-based efficient fine-tuning, Fake-Real DiT parameter alignment, and a weak-weight-guided distribution matching mechanism—overcoming the inaccuracy of conventional KL divergence under extremely few sampling steps (e.g., 7 steps). Evaluated on HunyuanVideo, W2SVD achieves superior performance using only 7 sampling steps—just 1/4 of the standard 28-step schedule—and outperforms state-of-the-art methods including LCM and DMD across multiple metrics: FID, FVD, and VBench. To our knowledge, W2SVD is the first approach to enable high-fidelity, stable, and computationally efficient step distillation for large-scale video diffusion models, establishing a new paradigm for practical video generation.

Technology Category

Application Category

📝 Abstract
Fine-tuning open-source large-scale VDMs for the portrait video synthesis task can result in significant improvements across multiple dimensions, such as visual quality and natural facial motion dynamics. Despite their advancements, how to achieve step distillation and reduce the substantial computational overhead of large-scale VDMs remains unexplored. To fill this gap, this paper proposes Weak-to-Strong Video Distillation (W2SVD) to mitigate both the issue of insufficient training memory and the problem of training collapse observed in vanilla DMD during the training process. Specifically, we first leverage LoRA to fine-tune the fake diffusion transformer (DiT) to address the out-of-memory issue. Then, we employ the W2S distribution matching to adjust the real DiT's parameter, subtly shifting it toward the fake DiT's parameter. This adjustment is achieved by utilizing the weak weight of the low-rank branch, effectively alleviate the conundrum where the video synthesized by the few-step generator deviates from the real data distribution, leading to inaccuracies in the KL divergence approximation. Additionally, we minimize the distance between the fake data distribution and the ground truth distribution to further enhance the visual quality of the synthesized videos. As experimentally demonstrated on HunyuanVideo, W2SVD surpasses the standard Euler, LCM, DMD and even the 28-step standard sampling in FID/FVD and VBench in 1/4-step video synthesis. The project page is in https://w2svd.github.io/W2SVD/.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in large-scale video synthesis models
Addresses training memory insufficiency and collapse in diffusion models
Enhances visual quality and accuracy in few-step video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA fine-tunes DiT for memory efficiency
W2SVD adjusts DiT parameters for better alignment
Minimizes distribution distance for enhanced video quality
🔎 Similar Papers
No similar papers found.