Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation models often suffer from temporal drift, weakened motion dynamics, and excessive smoothing when operating at extremely low inference budgets (2–4 NFE). To address this, this work proposes Self-Consistent Distribution Matching Distillation (SC-DMD), which introduces endpoint consistency regularization into distribution matching distillation for the first time. By leveraging KV cache as a quality-controlling condition and employing cache-aware multi-step rollback training, SC-DMD aligns conditional features throughout the denoising trajectory. This enables joint optimization of the KV cache and the distilled distribution, significantly enhancing generation quality under low-NFE regimes. The method consistently yields sharper, more dynamic video outputs across diverse backbone architectures—including Wan 2.1 and Self Forcing—and is compatible with various KV caching mechanisms, facilitating efficient real-time deployment.
📝 Abstract
Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed \textbf{Salt}, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at \href{https://github.com/XingtongGe/Salt}{https://github.com/XingtongGe/Salt}.
Problem

Research questions and friction points this paper is trying to address.

video generation
model distillation
low-NFE inference
temporal consistency
distribution matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Consistent Distillation
Distribution Matching
Cache-Aware Training
Low-NFE Video Generation
KV Cache Conditioning
🔎 Similar Papers
No similar papers found.