🤖 AI Summary
This work addresses the challenge that existing video generation models struggle to maintain coherent dynamic state evolution during occlusions or interruptions, often resulting in frozen latent states. To overcome this limitation, the authors propose ReMind, a framework that endows video diffusion Transformers with dynamic memory capabilities through a comprehensive taxonomy of over one hundred dynamic event classes, memory-interruption-augmented training data, and an event-aware node-structured curriculum learning strategy. The method innovatively integrates KV caching, PM-RoPE for efficient spatiotemporal positional encoding, frame-graph structural modeling, and reference-based cache training. Experimental results demonstrate that ReMind achieves state-of-the-art performance on the STEVO-Bench benchmark and state recovery tasks, while exhibiting no catastrophic forgetting in general image-to-video generation.
📝 Abstract
Video world models should maintain evolving states when evidence is unobserved, yet current generators often freeze hidden states upon interruption. This is not simply a capacity problem: pretrained video diffusion transformers already possess KV-cache mechanisms capable of non-local retrieval, but they are rarely trained to use them as dynamic memory. We introduce ReMind, a framework eliciting dynamic memory behavior via memory-oriented data, event-aware training, and cache adaptation. Organized around a taxonomy of 100+ dynamic events, we build a camera-annotated training mixture combining VLM-filtered real videos, generated hard dynamics, synthetic camera loops, and memory-interruption augmentations. Each clip is converted into a frame graph with protected anchors, degraded intervals, and explicit temporal gaps. A node-structured curriculum, including node-drop, noisy memory, frontier continuation, and reference-cache training, forces the model to retrieve relevant past states across interruptions rather than relying solely on local continuity. PM-RoPE, an elegant camera-phase RoPE extension, unlocks spatiotemporal retrieval at a single-attention cost while preserving pretrained pathways. ReMind achieves the best overall scores on STEVO-Bench and recovery tasks. Furthermore, general image-to-video evaluations confirm this curriculum avoids catastrophic forgetting. We will open-source our code, data, and models.