Progressive Autoregressive Video Diffusion Models

📅 2024-10-10

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Current video diffusion models are computationally constrained, limiting generation to ~10-second clips; autoregressive extension methods suffer from abrupt scene transitions, unnatural motion, and error accumulation due to naive frame concatenation. To address this, we propose a progressive frame-level noise scheduling scheme coupled with a fine-grained denoising-and-shift strategy, relaxing the conventional single-noise-level assumption. Within a sliding attention window, our approach enables smooth inter-frame attention alignment and persistent cross-segment information propagation. To our knowledge, this is the first method enabling text-to-video synthesis of 60-second (1,440-frame) sequences. It achieves significantly improved temporal consistency and visual fidelity, with minimal quality degradation along the time axis—matching the performance of state-of-the-art short-video diffusion models.

Technology Category

Application Category

📝 Abstract

Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. Existing methods naively achieve autoregressive long video generation by directly placing the ending of the previous clip at the front of the attention window as conditioning, which leads to abrupt scene changes, unnatural motion, and error accumulation. In this work, we introduce a more natural formulation of autoregressive long video generation by revisiting the noise level assumption in video diffusion models. Our key idea is to 1. assign the frames with per-frame, progressively increasing noise levels rather than a single noise level and 2. denoise and shift the frames in small intervals rather than all at once. This allows for smoother attention correspondence among frames with adjacent noise levels, larger overlaps between the attention windows, and better propagation of information from the earlier to the later frames. Video diffusion models equipped with our progressive noise schedule can autoregressively generate long videos with much improved fidelity compared to the baselines and minimal quality degradation over time. We present the first results on text-conditioned 60-second (1440 frames) long video generation at a quality close to frontier models. Code and video results are available at https://desaixie.github.io/pa-vdm/.

Problem

Research questions and friction points this paper is trying to address.

Extend video diffusion models to generate longer videos

Reduce abrupt scene changes and unnatural motion in autoregressive generation

Improve long video fidelity with progressive noise scheduling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive noise levels per frame

Denoise and shift small intervals

Smoother attention correspondence improvement

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling

2024-10-08arXiv.orgCitations: 31

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence