Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing interactive long-video generation methods suffer from insufficient historical context modeling, leading to poor scene consistency. To address this, we propose a novel “context-as-memory” paradigm: historical frames are directly treated as retrievable memory, enabling conditional modeling via frame-wise concatenation—eliminating the need for auxiliary control modules. We further design a lightweight memory retrieval mechanism based on camera field-of-view (FOV) overlap, ensuring information integrity while substantially reducing computational overhead. Our approach is embodied in an end-to-end trainable video diffusion architecture. On interactive long-video generation benchmarks, our method significantly outperforms state-of-the-art approaches, demonstrating strong generalization to unseen open-domain scenes and reducing redundant computation by over 40%.

Technology Category

Application Category

📝 Abstract

Recent advances in interactive video generation have shown promising results, yet existing approaches struggle with scene-consistent memory capabilities in long video generation due to limited use of historical context. In this work, we propose Context-as-Memory, which utilizes historical context as memory for video generation. It includes two simple yet effective designs: (1) storing context in frame format without additional post-processing; (2) conditioning by concatenating context and frames to be predicted along the frame dimension at the input, requiring no external control modules. Furthermore, considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module to select truly relevant context frames by determining FOV (Field of View) overlap between camera poses, which significantly reduces the number of candidate frames without substantial information loss. Experiments demonstrate that Context-as-Memory achieves superior memory capabilities in interactive long video generation compared to SOTAs, even generalizing effectively to open-domain scenarios not seen during training. The link of our project page is https://context-as-memory.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Enhancing scene-consistent memory in long video generation

Reducing computational overhead with relevant context retrieval

Improving interactive video generation without external control modules

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses historical context as memory for video generation

Stores context in frame format without post-processing

Retrieves relevant frames via FOV overlap analysis

🔎 Similar Papers

Video In-context Learning