StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Diffusion-based world models suffer from poor long-range visual consistency due to reliance on short observation sequences, causing generated scenes to rapidly diverge from historical context. To address this, we propose the first integration of state space models—specifically Mamba—into diffusion-based world modeling, yielding a unified architecture for sequential representation learning and conditional image generation with explicit long-term memory. Our approach preserves generation fidelity while substantially improving temporal coherence. We introduce a customized evaluation protocol to quantitatively assess long-horizon consistency. Experiments in 2D maze and complex 3D environments demonstrate that our method improves visual context coherence by over an order of magnitude for rollouts exceeding 100 steps compared to diffusion-only baselines, effectively mitigating the long-term state forgetting problem inherent in conventional diffusion world models.

Technology Category

Application Category

📝 Abstract

World models have recently become promising tools for predicting realistic visuals based on actions in complex environments. However, their reliance on a short sequence of observations causes them to quickly lose track of context. As a result, visual consistency breaks down after just a few steps, and generated scenes no longer reflect information seen earlier. This limitation of the state-of-the-art diffusion-based world models comes from their lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform on long-context tasks by integrating a sequence representation from a state-space model (Mamba), representing the entire interaction history. This design restores long-term memory without sacrificing the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory.

Problem

Research questions and friction points this paper is trying to address.

Diffusion world models lack long-term context memory

Visual consistency breaks down after few steps

State-space integration needed for lasting environment state

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates state-space model with diffusion model

Enables long-context tasks with lasting memory

Maintains visual consistency over extended steps

🔎 Similar Papers

Diffusion Models: A Comprehensive Survey of Methods and Applications