🤖 AI Summary
Diffusion-based video generation suffers from high computational cost due to iterative denoising. While consistency models enable faster sampling, direct adaptation often introduces temporal distortions and texture degradation. To address this, we propose a dual-expert collaborative distillation framework: a novel semantic expert dedicated to modeling global temporal consistency, and a detail expert focused on preserving intra-frame texture fidelity. These experts are jointly optimized via a composite objective comprising temporal consistency loss, adversarial loss, and feature-matching loss—effectively mitigating gradient conflicts across timesteps. Our method achieves high-fidelity video synthesis with only ≤4 sampling steps, significantly improving both temporal coherence and fine-grained texture quality. Extensive evaluations demonstrate state-of-the-art visual quality across multiple benchmarks, outperforming existing diffusion and consistency-modeling approaches in both perceptual metrics and human evaluation.
📝 Abstract
Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient extbf{Dual-Expert Consistency Model~(DCM)}, where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert.Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models are available at href{https://github.com/Vchitect/DCM}{https://github.com/Vchitect/DCM}.