VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

258K/year

🤖 AI Summary

To address the degradation of minute-scale temporal coherence in autoregressive long-video generation—caused by error accumulation, motion drift, and content repetition—this paper proposes the Hybrid State-Space Memory (HSSM) architecture. HSSM integrates state-space models (SSMs) with a sliding local context window to establish a dynamic memory mechanism that jointly ensures global consistency and fine-grained fidelity. It unifies modeling of short- and long-range dependencies, enables linear-complexity efficient generation over extended sequences, and supports streaming interaction and prompt-based control. Embedded within an autoregressive diffusion framework, HSSM significantly improves temporal consistency and motion stability. It achieves state-of-the-art performance on both short-horizon and long-horizon video generation benchmarks. By enabling scalable, controllable, minute-long video synthesis, HSSM establishes a novel and extensible paradigm for long-duration generative video modeling.

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.

Problem

Research questions and friction points this paper is trying to address.

Addresses coherence loss in long video generation

Combats motion drift and content repetition issues

Enables scalable, interactive minute-scale video synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid state-space memory for global scene dynamics

Autoregressive diffusion with local context window

Linear time scaling for minute-scale video generation

🔎 Similar Papers

Loong: Generating Minute-level Long Videos with Autoregressive Language Models