Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing video generation models often suffer from temporal drift, weakened motion dynamics, and excessive smoothing when operating at extremely low inference budgets (2–4 NFE). To address this, this work proposes Self-Consistent Distribution Matching Distillation (SC-DMD), which introduces endpoint consistency regularization into distribution matching distillation for the first time. By leveraging KV cache as a quality-controlling condition and employing cache-aware multi-step rollback training, SC-DMD aligns conditional features throughout the denoising trajectory. This enables joint optimization of the KV cache and the distilled distribution, significantly enhancing generation quality under low-NFE regimes. The method consistently yields sharper, more dynamic video outputs across diverse backbone architectures—including Wan 2.1 and Self Forcing—and is compatible with various KV caching mechanisms, facilitating efficient real-time deployment.

Technology Category

Application Category

📝 Abstract

Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed \textbf{Salt}, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at \href{https://github.com/XingtongGe/Salt}{https://github.com/XingtongGe/Salt}.

Problem

Research questions and friction points this paper is trying to address.

video generation

model distillation

low-NFE inference

temporal consistency

distribution matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Consistent Distillation

Distribution Matching

Cache-Aware Training