Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing autoregressive streaming video generation methods suffer from severe error accumulation due to iterative token-by-token synthesis, hindering real-time generation of high-quality minute-long videos. This paper proposes a diffusion-based framework for long-horizon streaming video generation. Our approach addresses key challenges through three core innovations: (1) a joint denoising mechanism with multi-frame parallel sampling to mitigate per-step error amplification; (2) attention anchoring and a Sink mechanism to enhance spatiotemporal coherence across frames; and (3) a few-step distillation training strategy under non-overlapping windows, balancing inference efficiency and stability. Implemented on a single GPU, our method achieves low-latency, highly coherent streaming generation of minute-scale videos. It substantially suppresses error propagation while delivering superior visual quality and temporal consistency compared to state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Streaming video generation, as one fundamental component in interactive world models and neural game engines, aims to generate high-quality, low-latency, and temporally coherent long video streams. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.

Problem

Research questions and friction points this paper is trying to address.

Reduces error accumulation in long video generation

Enhances long-term global consistency in streaming

Enables real-time multi-minute video generation efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint denoising scheme with progressive noise levels

Attention sink mechanism for global context anchoring

Few-step distillation algorithm for extended windows

🔎 Similar Papers

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence