🤖 AI Summary
This work addresses the dual challenges of error accumulation and memory bottlenecks in autoregressive video diffusion models (ARVDMs) for long-video generation. Methodologically, we propose Meta-ARVDM, a unified meta-framework featuring spatiotemporal frame compression encoding and multi-frame memory enhancement, enabling a Pareto-optimal trade-off between error propagation and memory overhead. Theoretically, we establish the first KL-divergence-based error analysis framework for ARVDMs, formally characterizing error propagation dynamics and proving—via information-theoretic arguments—that the memory bottleneck is fundamentally unavoidable. Empirically, Meta-ARVDM achieves state-of-the-art long-video generation quality on DMLab and Minecraft benchmarks, while simultaneously improving inference efficiency and reducing GPU memory consumption. Our results empirically validate its frontier performance on the error–memory Pareto frontier.
📝 Abstract
A variety of Auto-Regressive Video Diffusion Models (ARVDM) have achieved remarkable successes in generating realistic long-form videos. However, theoretical analyses of these models remain scant. In this work, we develop theoretical underpinnings for these models and use our insights to improve the performance of existing models. We first develop Meta-ARVDM, a unified framework of ARVDMs that subsumes most existing methods. Using Meta-ARVDM, we analyze the KL-divergence between the videos generated by Meta-ARVDM and the true videos. Our analysis uncovers two important phenomena inherent to ARVDM -- error accumulation and memory bottleneck. By deriving an information-theoretic impossibility result, we show that the memory bottleneck phenomenon cannot be avoided. To mitigate the memory bottleneck, we design various network structures to explicitly use more past frames. We also achieve a significantly improved trade-off between the mitigation of the memory bottleneck and the inference efficiency by compressing the frames. Experimental results on DMLab and Minecraft validate the efficacy of our methods. Our experiments also demonstrate a Pareto-frontier between the error accumulation and memory bottleneck across different methods.