Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Continuous-Time Consistency Distillation (CTCD) faces scalability challenges for large-scale image/video diffusion models due to prohibitive Jacobian-vector product (JVP) computational overhead and inadequate evaluation benchmarks. Method: This work pioneers the successful application of CTCD to 14B-parameter text-to-image/video generation, introducing the robust Consistency Model (rCM) framework. rCM employs score distillation as a long-horizon regularization term and incorporates backward divergence with pattern search to mitigate error accumulation in standard consistency models (sCM). We design an efficient JVP kernel leveraging FlashAttention-2 and integrate parallel computation optimizations. Results: On 5-second video generation, rCM achieves high-fidelity sampling in just 1–4 steps—accelerating inference by 15–50×—while matching DMD2’s FID and CLIP scores and significantly surpassing it in sample diversity.

Technology Category

Application Category

📝 Abstract
This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the"mode-covering"nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the"mode-seeking"reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1sim4$ steps, accelerating diffusion sampling by $15 imessim50 imes$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.
Problem

Research questions and friction points this paper is trying to address.

Scaling continuous-time consistency distillation to billion-parameter image/video models
Addressing quality limitations in fine-detail generation for consistency models
Enabling few-step high-fidelity generation while maintaining output diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

FlashAttention-2 JVP kernel enables training on 10B+ parameters
Score-regularized consistency model integrates distillation as regularizer
Distilled models generate high-fidelity samples in 1-4 steps
🔎 Similar Papers
No similar papers found.