Taming Teacher Forcing for Masked Autoregressive Video Generation

๐Ÿ“… 2025-01-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the limited temporal coherence and naturalness in long-horizon video generation under scarce training data, this paper proposes MAGIโ€”a unified framework integrating intra-frame masked modeling with inter-frame causal modeling, supported by a hierarchical video tokenization and reconstruction architecture. We introduce Complete Teacher Forcing (CTF), a novel training mechanism that replaces masked frames with fully observed ground-truth frames to guide autoregressive generation, enabling smooth modeling from patch-level to frame-level and effectively mitigating exposure bias. Evaluated under extreme data constraints (only 16 training frames), MAGI achieves high-fidelity, temporally consistent generation of videos exceeding 100 frames. In first-frame conditional prediction, it reduces Frรฉchet Video Distance (FVD) by 23% over prior methods, substantially advancing autoregressive video generation performance and establishing a new state-of-the-art benchmark.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.
Problem

Research questions and friction points this paper is trying to address.

Video Generation
Coherence
Limited Training Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

MAGI
Conditional Teacher Forcing
High-quality Long Video Generation
๐Ÿ”Ž Similar Papers
No similar papers found.