🤖 AI Summary
Diffusion-based world models suffer from poor long-range visual consistency due to reliance on short observation sequences, causing generated scenes to rapidly diverge from historical context. To address this, we propose the first integration of state space models—specifically Mamba—into diffusion-based world modeling, yielding a unified architecture for sequential representation learning and conditional image generation with explicit long-term memory. Our approach preserves generation fidelity while substantially improving temporal coherence. We introduce a customized evaluation protocol to quantitatively assess long-horizon consistency. Experiments in 2D maze and complex 3D environments demonstrate that our method improves visual context coherence by over an order of magnitude for rollouts exceeding 100 steps compared to diffusion-only baselines, effectively mitigating the long-term state forgetting problem inherent in conventional diffusion world models.
📝 Abstract
World models have recently become promising tools for predicting realistic visuals based on actions in complex environments. However, their reliance on a short sequence of observations causes them to quickly lose track of context. As a result, visual consistency breaks down after just a few steps, and generated scenes no longer reflect information seen earlier. This limitation of the state-of-the-art diffusion-based world models comes from their lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform on long-context tasks by integrating a sequence representation from a state-space model (Mamba), representing the entire interaction history. This design restores long-term memory without sacrificing the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory.