Taming Teacher Forcing for Masked Autoregressive Video Generation

📅 2025-01-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address the limited temporal coherence and naturalness in long-horizon video generation under scarce training data, this paper proposes MAGI—a unified framework integrating intra-frame masked modeling with inter-frame causal modeling, supported by a hierarchical video tokenization and reconstruction architecture. We introduce Complete Teacher Forcing (CTF), a novel training mechanism that replaces masked frames with fully observed ground-truth frames to guide autoregressive generation, enabling smooth modeling from patch-level to frame-level and effectively mitigating exposure bias. Evaluated under extreme data constraints (only 16 training frames), MAGI achieves high-fidelity, temporally consistent generation of videos exceeding 100 frames. In first-frame conditional prediction, it reduces Fréchet Video Distance (FVD) by 23% over prior methods, substantially advancing autoregressive video generation performance and establishing a new state-of-the-art benchmark.

Technology Category

Application Category

📝 Abstract

We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.

Problem

Research questions and friction points this paper is trying to address.

Video Generation

Coherence

Limited Training Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

MAGI

Conditional Teacher Forcing

High-quality Long Video Generation

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling