Teaching Video Generators to Remember: Eliciting Dynamic Memory for Out-of-Sight State Evolution

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the challenge that existing video generation models struggle to maintain coherent dynamic state evolution during occlusions or interruptions, often resulting in frozen latent states. To overcome this limitation, the authors propose ReMind, a framework that endows video diffusion Transformers with dynamic memory capabilities through a comprehensive taxonomy of over one hundred dynamic event classes, memory-interruption-augmented training data, and an event-aware node-structured curriculum learning strategy. The method innovatively integrates KV caching, PM-RoPE for efficient spatiotemporal positional encoding, frame-graph structural modeling, and reference-based cache training. Experimental results demonstrate that ReMind achieves state-of-the-art performance on the STEVO-Bench benchmark and state recovery tasks, while exhibiting no catastrophic forgetting in general image-to-video generation.

📝 Abstract

Video world models should maintain evolving states when evidence is unobserved, yet current generators often freeze hidden states upon interruption. This is not simply a capacity problem: pretrained video diffusion transformers already possess KV-cache mechanisms capable of non-local retrieval, but they are rarely trained to use them as dynamic memory. We introduce ReMind, a framework eliciting dynamic memory behavior via memory-oriented data, event-aware training, and cache adaptation. Organized around a taxonomy of 100+ dynamic events, we build a camera-annotated training mixture combining VLM-filtered real videos, generated hard dynamics, synthetic camera loops, and memory-interruption augmentations. Each clip is converted into a frame graph with protected anchors, degraded intervals, and explicit temporal gaps. A node-structured curriculum, including node-drop, noisy memory, frontier continuation, and reference-cache training, forces the model to retrieve relevant past states across interruptions rather than relying solely on local continuity. PM-RoPE, an elegant camera-phase RoPE extension, unlocks spatiotemporal retrieval at a single-attention cost while preserving pretrained pathways. ReMind achieves the best overall scores on STEVO-Bench and recovery tasks. Furthermore, general image-to-video evaluations confirm this curriculum avoids catastrophic forgetting. We will open-source our code, data, and models.

Problem

Research questions and friction points this paper is trying to address.

video generation

dynamic memory

state evolution

out-of-sight

temporal continuity

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic memory

video world models

memory-oriented training