🤖 AI Summary
Generating high-quality, minute-long coherent videos remains a critical challenge for world model construction. Existing approaches suffer from long-range error accumulation due to KV caching and lack fine-grained long-video evaluation benchmarks and consistency metrics. This paper introduces BlockVid, a novel framework addressing these limitations: (1) a semantic-aware sparse KV cache and Block Forcing training strategy to suppress error propagation; (2) block-wise noise scheduling and data shuffling to synergistically integrate diffusion and autoregressive modeling, enabling efficient arbitrary-length generation; and (3) LV-Bench—the first fine-grained long-video benchmark—along with a quantitative long-range consistency metric. Experiments demonstrate that BlockVid significantly outperforms prior methods on VBench and LV-Bench, achieving +22.2% and +19.4% improvements in VDE Subject and Clarity, respectively, yielding markedly clearer and temporally consistent minute-long videos.
📝 Abstract
Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.