Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching

πŸ“… 2025-07-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Video generation diffusion models suffer from slow inference and high computational overhead due to iterative denoising, severely hindering practical deployment. To address this, we propose EasyCacheβ€”a training-free, architecture-agnostic runtime adaptive caching framework. Its core innovation lies in a lightweight online feature vector caching and dynamic reuse mechanism that automatically identifies and exploits redundant computations solely via real-time statistics, eliminating the need for offline analysis or hyperparameter tuning. EasyCache is compatible with mainstream architectures including OpenSora, Wan2.1, and HunyuanVideo. Under identical generation quality, it achieves 2.1–3.3Γ— inference speedup and up to 36% PSNR improvement, significantly outperforming existing acceleration methods.

Technology Category

Application Category

πŸ“ Abstract
Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3$ imes$ compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at https://github.com/H-EmbodVis/EasyCache.
Problem

Research questions and friction points this paper is trying to address.

Accelerate video diffusion models without training
Reduce redundant computations in denoising process
Maintain high visual fidelity during inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free video diffusion acceleration framework
Runtime-adaptive caching mechanism
Reuses computed transformation vectors dynamically
πŸ”Ž Similar Papers
X
Xin Zhou
Huazhong University of Science and Technology
Dingkang Liang
Dingkang Liang
Huazhong University of Science and Technology
Embodied AIWorld ModelAutonomous DrivingCrowd Counting
K
Kaijin Chen
Huazhong University of Science and Technology
T
Tianrui Feng
Huazhong University of Science and Technology
X
Xiwu Chen
MEGVII Technology
H
Hongkai Lin
Huazhong University of Science and Technology
Yikang Ding
Yikang Ding
Tsinghua University
3D VisionGenerative Model
F
Feiyang Tan
MEGVII Technology
Hengshuang Zhao
Hengshuang Zhao
The University of Hong Kong
Computer VisionMachine LearningArtificial Intelligence
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR