Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the dual challenges of error accumulation and memory bottlenecks in autoregressive video diffusion models (ARVDMs) for long-video generation. Methodologically, we propose Meta-ARVDM, a unified meta-framework featuring spatiotemporal frame compression encoding and multi-frame memory enhancement, enabling a Pareto-optimal trade-off between error propagation and memory overhead. Theoretically, we establish the first KL-divergence-based error analysis framework for ARVDMs, formally characterizing error propagation dynamics and proving—via information-theoretic arguments—that the memory bottleneck is fundamentally unavoidable. Empirically, Meta-ARVDM achieves state-of-the-art long-video generation quality on DMLab and Minecraft benchmarks, while simultaneously improving inference efficiency and reducing GPU memory consumption. Our results empirically validate its frontier performance on the error–memory Pareto frontier.

Technology Category

Application Category

📝 Abstract
A variety of Auto-Regressive Video Diffusion Models (ARVDM) have achieved remarkable successes in generating realistic long-form videos. However, theoretical analyses of these models remain scant. In this work, we develop theoretical underpinnings for these models and use our insights to improve the performance of existing models. We first develop Meta-ARVDM, a unified framework of ARVDMs that subsumes most existing methods. Using Meta-ARVDM, we analyze the KL-divergence between the videos generated by Meta-ARVDM and the true videos. Our analysis uncovers two important phenomena inherent to ARVDM -- error accumulation and memory bottleneck. By deriving an information-theoretic impossibility result, we show that the memory bottleneck phenomenon cannot be avoided. To mitigate the memory bottleneck, we design various network structures to explicitly use more past frames. We also achieve a significantly improved trade-off between the mitigation of the memory bottleneck and the inference efficiency by compressing the frames. Experimental results on DMLab and Minecraft validate the efficacy of our methods. Our experiments also demonstrate a Pareto-frontier between the error accumulation and memory bottleneck across different methods.
Problem

Research questions and friction points this paper is trying to address.

Develops a unified framework for Auto-Regressive Video Diffusion Models (ARVDM).
Analyzes error accumulation and memory bottleneck in ARVDM.
Proposes methods to mitigate memory bottleneck and improve inference efficiency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed Meta-ARVDM, a unified ARVDM framework.
Designed network structures to use more past frames.
Compressed frames to balance memory and efficiency.
🔎 Similar Papers
No similar papers found.
J
Jing Wang
Sea AI Lab, Nanyang Technological University, A*STAR
Fengzhuo Zhang
Fengzhuo Zhang
NUS
X
Xiaoli Li
Nanyang Technological University
Vincent Y. F. Tan
Vincent Y. F. Tan
Professor, Department of Mathematics, National University of Singapore
Information TheoryMachine LearningSignal Processing
T
Tianyu Pang
Sea AI Lab
C
Chao Du
Sea AI Lab
A
Aixin Sun
A*STAR
Zhuoran Yang
Zhuoran Yang
Yale University
machine learningoptimizationreinforcement learningstatistics