DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the limitations of traditional autoregressive long video generation, which relies on static early frames as fixed anchors and struggles to adapt to dynamic visual changes, often leading to content degradation and attention collapse. To overcome these issues, the authors propose DySink, a novel framework that constructs a compact memory bank and dynamically retrieves visually relevant past frames as sinks, augmented with an anomaly-gating mechanism to suppress attention collapse. The method incorporates retrieval-based dynamic frame selection, RoPE phase alignment, and multi-head attention consistency checking to enable adaptive modeling of long-range temporal context. Evaluated on minute-long video generation tasks, DySink substantially outperforms strong baselines, achieving significant improvements in temporal coherence and dynamic fidelity.

📝 Abstract

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at https://github.com/yebo0216best/DySink.

Problem

Research questions and friction points this paper is trying to address.

autoregressive video generation

long-range memory

frame sinks

sink collapse

temporal consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Frame Sinks

Autoregressive Video Generation

Memory Retrieval