๐ค AI Summary
To address the limited temporal coherence and naturalness in long-horizon video generation under scarce training data, this paper proposes MAGIโa unified framework integrating intra-frame masked modeling with inter-frame causal modeling, supported by a hierarchical video tokenization and reconstruction architecture. We introduce Complete Teacher Forcing (CTF), a novel training mechanism that replaces masked frames with fully observed ground-truth frames to guide autoregressive generation, enabling smooth modeling from patch-level to frame-level and effectively mitigating exposure bias. Evaluated under extreme data constraints (only 16 training frames), MAGI achieves high-fidelity, temporally consistent generation of videos exceeding 100 frames. In first-frame conditional prediction, it reduces Frรฉchet Video Distance (FVD) by 23% over prior methods, substantially advancing autoregressive video generation performance and establishing a new state-of-the-art benchmark.
๐ Abstract
We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.