Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing distillation methods for video diffusion models directly adopt image-based distillation strategies, often leading to issues such as oversaturation, temporal inconsistency, and mode collapse. This work proposes the first distillation framework specifically designed for video diffusion models, featuring an adaptive regression loss that dynamically modulates spatial supervision strength and a temporal regularization loss to suppress inter-frame inconsistencies. Coupled with a frame interpolation strategy during inference, the method enables efficient, high-quality video generation at extremely low sampling steps. Evaluated on the VBench and VBench2 benchmarks, the proposed approach significantly outperforms existing distillation techniques, achieving superior perceptual fidelity, natural motion dynamics, and stable few-step synthesis.

Technology Category

Application Category

📝 Abstract

Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.

Problem

Research questions and friction points this paper is trying to address.

video distillation

oversaturation

temporal collapse

few-step generation

temporal inconsistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive regression loss

temporal regularization

frame interpolation