Pyramidal Flow Matching for Efficient Video Generative Modeling

๐Ÿ“… 2024-10-08
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 31
โœจ Influential: 7
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Video generation faces significant challenges in spatiotemporal modeling complexity, high computational cost, and substantial data requirements. While existing cascaded approaches alleviate per-stage burden, their independently optimized submodules hinder knowledge sharing and end-to-end co-optimization. To address this, we propose the **Unified Pyramid Flow Matching (PFM)** framework: it decomposes the denoising trajectory into a multi-scale pyramid, where only the top level operates on full-resolution latent representations; introduces the first cross-scale flow continuity modeling mechanism; and incorporates a temporal pyramid autoregressive architecture to compress full-resolution historyโ€”enabling unified modeling within a single DiT backbone. Trained for 20.7k A100 GPU-hours, PFM generates high-fidelity videos up to 10 seconds long at 768p resolution and 24 FPS. All code and models are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution latent. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models are open-sourced at https://pyramid-flow.github.io.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational complexity in video generation
Enables efficient spatiotemporal modeling with pyramidal flow
Supports high-quality video generation at 768p resolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified pyramidal flow matching algorithm
End-to-end optimization with Diffusion Transformer
Autoregressive video generation with temporal pyramid
๐Ÿ”Ž Similar Papers
2024-07-11Neural Information Processing SystemsCitations: 0