Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

📅 2025-12-21

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Long-video generation faces dual challenges: poor historical scene consistency and excessive memory consumption—windowed attention causes catastrophic forgetting, while full-history modeling incurs GPU memory bottlenecks. To address this, we propose Memorize-and-Generate (MAG), the first framework to decouple memory compression from frame generation, featuring a lightweight KV cache compression mechanism and a co-training architecture. We introduce MAG-Bench, the first benchmark explicitly designed to evaluate historical memory retention. Additionally, MAG incorporates frame-level autoregressive modeling, enhanced windowed attention, and a novel historical consistency constraint loss. Experiments demonstrate that MAG maintains state-of-the-art (SOTA) performance on standard metrics while improving historical memory retention by 37% and reducing inference latency by 62% compared to full-history attention.

Technology Category

Application Category

📝 Abstract

Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose extbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce extbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Addresses catastrophic forgetting in long video generation

Reduces memory costs while preserving historical context

Enhances scene consistency through decoupled memory and generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples memory compression and frame generation tasks

Trains separate models for memory and frame synthesis

Uses compact KV cache to retain historical context

🔎 Similar Papers

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance