🤖 AI Summary
Existing autoregressive video diffusion models suffer from severe computational redundancy in long-video generation: overlapping conditional frames between adjacent segments undergo repeated denoising, causing quadratic growth in computational cost with context length. This work introduces a causal generation framework with cache sharing, enabling the first unidirectional, precomputable caching and reuse of conditional-frame features. By sharing cached features across denoising steps, our approach breaks the conventional quadratic complexity barrier. The core innovations lie in temporal feature disentanglement and cache consistency design, both rigorously enforced under causal modeling constraints. Experiments demonstrate substantial acceleration—2.1–3.8× faster inference—alongside significant reductions in GPU memory consumption and FLOPs. Our method achieves state-of-the-art quantitative performance and visual quality on standard benchmarks including UCF-101 and BAIR.
📝 Abstract
With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose Ca2-VDM, an efficient autoregressive VDM with Causal generation and Cache sharing. For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available: https://github.com/Dawn-LX/CausalCache-VDM