🤖 AI Summary
To address the GPU memory explosion caused by caching acceleration in diffusion-based video generation, this paper proposes a stage-wise memory optimization strategy: asynchronous cache swapping during encoding, feature chunking during denoising, and latent-space slicing during decoding, coordinated by a unified cache management mechanism. The method requires no model fine-tuning or retraining. It achieves up to 62% reduction in peak GPU memory consumption over baseline approaches, while preserving inference latency and maintaining controlled quality degradation (FVD increase <5%). Its core innovation lies in the first-stage–specific design of memory optimization techniques—tailored precisely to each phase of the inference pipeline—thereby achieving a positive trade-off between computational overhead and acceleration benefits. The implementation is publicly available.
📝 Abstract
Training-free acceleration has emerged as an advanced research area in video generation based on diffusion models. The redundancy of latents in diffusion model inference provides a natural entry point for acceleration. In this paper, we decompose the inference process into the encoding, denoising, and decoding stages, and observe that cache-based acceleration methods often lead to substantial memory surges in the latter two stages. To address this problem, we analyze the characteristics of inference across different stages and propose stage-specific strategies for reducing memory consumption: 1) Asynchronous Cache Swapping. 2) Feature chunk. 3) Slicing latents to decode. At the same time, we ensure that the time overhead introduced by these three strategies remains lower than the acceleration gains themselves. Compared with the baseline, our approach achieves faster inference speed and lower memory usage, while maintaining quality degradation within an acceptable range. The Code is available at https://github.com/NKUShaw/LightCache .