MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing diffusion models generate only second-scale videos; minute-scale long-video synthesis remains an open challenge. This paper proposes a segment-level autoregressive diffusion framework that decomposes long videos into short clips and models inter-segment long-range temporal dependencies via a memory-augmented latent-space Transformer. We introduce two key innovations: (i) a novel recurrent attention mechanism coupled with compact memory latent vectors to enable persistent temporal state propagation, and (ii) latent-space temporal conditioning and long-context stability training to mitigate quality degradation. Evaluated on UCF-101, our method generates 128-frame videos with an FVD of 220.4—surpassing the prior state-of-the-art by over 428 points. Moreover, it supports text-driven generation of arbitrarily long videos, achieving significant improvements in video duration, temporal coherence, and visual fidelity compared to existing approaches.

Technology Category

Application Category

📝 Abstract

Diffusion models are successful for synthesizing high-quality videos but are limited to generating short clips (e.g., 2-10 seconds). Synthesizing sustained footage (e.g. over minutes) still remains an open research question. In this paper, we propose MALT Diffusion (using Memory-Augmented Latent Transformers), a new diffusion model specialized for long video generation. MALT Diffusion (or just MALT) handles long videos by subdividing them into short segments and doing segment-level autoregressive generation. To achieve this, we first propose recurrent attention layers that encode multiple segments into a compact memory latent vector; by maintaining this memory vector over time, MALT is able to condition on it and continuously generate new footage based on a long temporal context. We also present several training techniques that enable the model to generate frames over a long horizon with consistent quality and minimal degradation. We validate the effectiveness of MALT through experiments on long video benchmarks. We first perform extensive analysis of MALT in long-contextual understanding capability and stability using popular long video benchmarks. For example, MALT achieves an FVD score of 220.4 on 128-frame video generation on UCF-101, outperforming the previous state-of-the-art of 648.4. Finally, we explore MALT's capabilities in a text-to-video generation setting and show that it can produce long videos compared with recent techniques for long text-to-video generation.

Problem

Research questions and friction points this paper is trying to address.

Long video generation challenge

Memory-Augmented Latent Transformers model

Consistent quality over extended frames

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-Augmented Latent Transformers

Segment-level autoregressive generation

Recurrent attention layers

🔎 Similar Papers

Latte: Latent Diffusion Transformer for Video Generation