MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Streaming long-video generation faces dual challenges: difficulty maintaining long-range content consistency and excessive memory overhead. This paper proposes a dynamic prompt-driven memory mechanism that adaptively retrieves, from the KV cache, historical frames most semantically relevant to the current text prompt; ensures temporal consistency via cross-frame semantic alignment; sparsely activates key memory tokens within attention layers; and reweights retrieved historical frames under text guidance. Unlike prior fixed-compression approaches, our method achieves dual dynamicism—both in memory retrieval and access—while preserving full compatibility with arbitrary streaming video generation models. Experiments demonstrate substantial improvements in narrative coherence and long-term consistency across evolving events, with only a 7.9% increase in inference latency. Our approach significantly outperforms baselines on quantitative consistency metrics.

Technology Category

Application Category

📝 Abstract

The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.

Problem

Research questions and friction points this paper is trying to address.

Maintains content consistency in long video generation.

Dynamically updates memory with relevant historical frames.

Ensures generation efficiency by activating relevant tokens.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic memory update with text-prompted frame retrieval

Selective token activation in attention for efficiency

Compatible with streaming models using KV cache

🔎 Similar Papers

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges