BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

274K/year

🤖 AI Summary

Generating high-quality, minute-long coherent videos remains a critical challenge for world model construction. Existing approaches suffer from long-range error accumulation due to KV caching and lack fine-grained long-video evaluation benchmarks and consistency metrics. This paper introduces BlockVid, a novel framework addressing these limitations: (1) a semantic-aware sparse KV cache and Block Forcing training strategy to suppress error propagation; (2) block-wise noise scheduling and data shuffling to synergistically integrate diffusion and autoregressive modeling, enabling efficient arbitrary-length generation; and (3) LV-Bench—the first fine-grained long-video benchmark—along with a quantitative long-range consistency metric. Experiments demonstrate that BlockVid significantly outperforms prior methods on VBench and LV-Bench, achieving +22.2% and +19.4% improvements in VDE Subject and Clarity, respectively, yielding markedly clearer and temporally consistent minute-long videos.

Technology Category

Application Category

📝 Abstract

Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.

Problem

Research questions and friction points this paper is trying to address.

Addresses error accumulation in long video generation

Overcomes lack of fine-grained long-video benchmarks

Improves temporal consistency in minute-long videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-aware sparse KV cache reduces error propagation

Block Forcing training strategy enhances temporal consistency

Chunk-wise noise scheduling and shuffling improves video coherence

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling