LoL: Longer than Longer, Scaling Video Generation to Hour

📅 2026-01-23

📈 Citations: 2

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the "sink frame collapse" phenomenon in autoregressive long-video generation, where excessive reliance on specific sink frames by attention mechanisms leads to abrupt scene transitions and repetitive motion patterns. To mitigate this issue without retraining, the authors propose a lightweight intervention: injecting random perturbations into the rotary positional embeddings (RoPE) within multi-head attention layers. This simple modification effectively breaks the homogeneity across attention heads, thereby suppressing sink frame collapse. The method enables real-time, streaming video generation of unlimited length while maintaining high visual fidelity. In public demonstrations, it successfully produced a continuous 12-hour video—the longest streaming-generation result reported to date—with minimal quality degradation over time.

Technology Category

Application Category

📝 Abstract

Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.

Problem

Research questions and friction points this paper is trying to address.

long-form video generation

autoregressive models

sink-collapse

long-term coherence

error accumulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

sink-collapse

RoPE jitter

long-form video generation