🤖 AI Summary
Autoregressive long video generation often suffers from quality degradation over time due to error accumulation, and existing key-value (KV) caching strategies uniformly retain historical frames across all attention heads, disregarding their heterogeneous temporal dependencies. This work reveals, for the first time, that attention heads can be categorized into three distinct types—Anchor, Wave, and Veil—and introduces a head-aware pyramid KV caching framework that assigns heterogeneous cache lengths tailored to each head type. The approach integrates offline head classification with an efficient sparse-caching attention mechanism. Evaluated on VBench-Long, it improves the 60-second Self-Forcing score from 77.87 to 81.21, significantly enhancing motion dynamics, visual fidelity, and semantic consistency.
📝 Abstract
Autoregressive video generation enables streaming and open-ended long video synthesis, but still suffers from long-term degradation caused by accumulated errors. Existing KVCache strategies usually apply unified historical-frame retention, implicitly assuming homogeneous historical dependencies across attention heads. We revisit historical-frame attention and reveal three distinct head types: Anchor Heads require broad long-range context, Wave Heads exhibit periodic temporal dependencies, and Veil Heads focus on initial and adjacent frames. Based on this finding, we propose Pyramid Forcing, a head-aware pyramidal KVCache framework that identifies head types offline, assigns behavior-specific cache policies, and supports heterogeneous cache lengths via efficient ragged-cache attention. Experiments on Self Forcing and Causal Forcing show that Pyramid Forcing consistently improves long-horizon generation quality on VBench-Long, increasing the 60-second Self Forcing score from 77.87 to 81.21 while enhancing motion dynamics, visual fidelity, and semantic consistency. Project: https://if-lab-pku.github.io/Pyramid-Forcing/.