Pyramidal Flow Matching for Efficient Video Generative Modeling

📅 2024-10-08

🏛️ arXiv.org

📈 Citations: 31

✨ Influential: 7

career value

213K/year

🤖 AI Summary

Video generation faces significant challenges in spatiotemporal modeling complexity, high computational cost, and substantial data requirements. While existing cascaded approaches alleviate per-stage burden, their independently optimized submodules hinder knowledge sharing and end-to-end co-optimization. To address this, we propose the **Unified Pyramid Flow Matching (PFM)** framework: it decomposes the denoising trajectory into a multi-scale pyramid, where only the top level operates on full-resolution latent representations; introduces the first cross-scale flow continuity modeling mechanism; and incorporates a temporal pyramid autoregressive architecture to compress full-resolution history—enabling unified modeling within a single DiT backbone. Trained for 20.7k A100 GPU-hours, PFM generates high-fidelity videos up to 10 seconds long at 768p resolution and 24 FPS. All code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution latent. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models are open-sourced at https://pyramid-flow.github.io.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational complexity in video generation

Enables efficient spatiotemporal modeling with pyramidal flow

Supports high-quality video generation at 768p resolution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified pyramidal flow matching algorithm

End-to-end optimization with Diffusion Transformer

Autoregressive video generation with temporal pyramid

🔎 Similar Papers

Generalizable Implicit Motion Modeling for Video Frame Interpolation