🤖 AI Summary
Long-video generation faces dual challenges: poor historical scene consistency and excessive memory consumption—windowed attention causes catastrophic forgetting, while full-history modeling incurs GPU memory bottlenecks. To address this, we propose Memorize-and-Generate (MAG), the first framework to decouple memory compression from frame generation, featuring a lightweight KV cache compression mechanism and a co-training architecture. We introduce MAG-Bench, the first benchmark explicitly designed to evaluate historical memory retention. Additionally, MAG incorporates frame-level autoregressive modeling, enhanced windowed attention, and a novel historical consistency constraint loss. Experiments demonstrate that MAG maintains state-of-the-art (SOTA) performance on standard metrics while improving historical memory retention by 37% and reducing inference latency by 62% compared to full-history attention.
📝 Abstract
Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose extbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce extbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.