🤖 AI Summary
This work addresses the inefficiency of multi-step sampling in large-scale video diffusion models, which hinders real-time interactive applications. To overcome this limitation, the authors propose the Transition Matching Distillation (TMD) framework, which efficiently distills a full video diffusion model into a few-step generator. TMD models the multi-step denoising trajectory as a lightweight conditional flow and introduces a backbone-flow head network decomposition to enable joint inner-outer layer distillation, achieving a flexible trade-off between generation speed and quality. Experimental results on Wan2.1 1.3B and 14B models demonstrate that TMD significantly outperforms existing distillation methods under comparable inference costs, delivering superior visual fidelity and prompt-following capability.
📝 Abstract
Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video diffusion model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence. Project page: https://research.nvidia.com/labs/genair/tmd