🤖 AI Summary
This work addresses the challenges of long-horizon quality degradation, high decoding latency, and excessive KV cache memory usage in autoregressive diffusion-based video generation. The authors propose a novel training and inference paradigm termed Sparse Forcing, which introduces a learnable native sparse attention mechanism that exploits the inherent sparsity of attention concentrated on salient visual blocks. This mechanism dynamically selects local neighborhoods for computation and is supported by a custom-designed Persistent Block-Sparse Attention (PBSA) GPU kernel optimized for large-scale training and inference. Experimental results demonstrate consistent improvements across text-to-video generation tasks of varying durations: VBench scores increase by +0.26, +0.68, and +2.74 for 5-second, 20-second, and 1-minute videos, respectively, while achieving 1.11–1.27× faster decoding and a 42% reduction in peak KV cache memory consumption.
📝 Abstract
We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing is motivated by an empirical observation in autoregressive diffusion rollouts: attention concentrates on a persistent subset of salient visual blocks, forming an implicit spatiotemporal memory in the KV cache, and exhibits a locally structured block-sparse pattern within sliding windows. Building on this observation, we propose a trainable native sparsity mechanism that learns to compress, preserve, and update these persistent blocks while restricting computation within each local window to a dynamically selected local neighborhood. To make the approach practical at scale for both training and inference, we further propose Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel that accelerates sparse attention and memory updates for low-latency, memory-efficient decoding. Experiments show that Sparse Forcing improves the VBench score by +0.26 over Self-Forcing on 5-second text-to-video generation while delivering a 1.11-1.17x decoding speedup and 42% lower peak KV-cache footprint. The gains are more pronounced on longer-horizon rollouts, delivering improved visual quality with +0.68 and +2.74 VBench improvements, and 1.22x and 1.27x speedups on 20-second and 1-minute generations, respectively.