Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-video generation faces dual challenges: poor historical scene consistency and excessive memory consumption—windowed attention causes catastrophic forgetting, while full-history modeling incurs GPU memory bottlenecks. To address this, we propose Memorize-and-Generate (MAG), the first framework to decouple memory compression from frame generation, featuring a lightweight KV cache compression mechanism and a co-training architecture. We introduce MAG-Bench, the first benchmark explicitly designed to evaluate historical memory retention. Additionally, MAG incorporates frame-level autoregressive modeling, enhanced windowed attention, and a novel historical consistency constraint loss. Experiments demonstrate that MAG maintains state-of-the-art (SOTA) performance on standard metrics while improving historical memory retention by 37% and reducing inference latency by 62% compared to full-history attention.

Technology Category

Application Category

📝 Abstract
Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose extbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce extbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addresses catastrophic forgetting in long video generation
Reduces memory costs while preserving historical context
Enhances scene consistency through decoupled memory and generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples memory compression and frame generation tasks
Trains separate models for memory and frame synthesis
Uses compact KV cache to retain historical context
🔎 Similar Papers
No similar papers found.
T
Tianrui Zhu
Tsinghua Shenzhen International Graduate School, Tsinghua University
Shiyi Zhang
Shiyi Zhang
Tsinghua University
Video GenerationVideo Understanding
Zhirui Sun
Zhirui Sun
Southern University of Science and Technology
Robot PerceptionPath Planning
J
Jingqi Tian
Tsinghua Shenzhen International Graduate School, Tsinghua University
Y
Yansong Tang
Tsinghua Shenzhen International Graduate School, Tsinghua University