🤖 AI Summary
Video diffusion models suffer from high computational complexity due to self-attention, hindering long-horizon video world modeling and impairing long-term memory retention and inter-frame consistency. To address this, we propose a long-horizon memory model for video world modeling. Our method integrates state space models (SSMs) with a novel block-wise temporal scanning mechanism—designed to preserve causality while supporting non-causal video modeling—and dense local attention to enhance spatiotemporal coherence among adjacent frames. This unified architecture enables efficient, scalable video diffusion-based world modeling. Evaluated on Memory Maze and Minecraft benchmarks, our model achieves significant improvements in spatial retrieval and reasoning accuracy at the hundred-frame scale, while maintaining real-time inference speed suitable for interactive applications. Key contributions include: (1) the first block-wise state-space scanning scheme tailored for video, and (2) synergistic integration of SSMs, block-wise scanning, and local attention for robust long-horizon video understanding.
📝 Abstract
Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory, while maintaining practical inference speeds suitable for interactive applications.