🤖 AI Summary
To address the degradation of minute-scale temporal coherence in autoregressive long-video generation—caused by error accumulation, motion drift, and content repetition—this paper proposes the Hybrid State-Space Memory (HSSM) architecture. HSSM integrates state-space models (SSMs) with a sliding local context window to establish a dynamic memory mechanism that jointly ensures global consistency and fine-grained fidelity. It unifies modeling of short- and long-range dependencies, enables linear-complexity efficient generation over extended sequences, and supports streaming interaction and prompt-based control. Embedded within an autoregressive diffusion framework, HSSM significantly improves temporal consistency and motion stability. It achieves state-of-the-art performance on both short-horizon and long-horizon video generation benchmarks. By enabling scalable, controllable, minute-long video synthesis, HSSM establishes a novel and extensible paradigm for long-duration generative video modeling.
📝 Abstract
Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.