BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generating high-quality, minute-long coherent videos remains a critical challenge for world model construction. Existing approaches suffer from long-range error accumulation due to KV caching and lack fine-grained long-video evaluation benchmarks and consistency metrics. This paper introduces BlockVid, a novel framework addressing these limitations: (1) a semantic-aware sparse KV cache and Block Forcing training strategy to suppress error propagation; (2) block-wise noise scheduling and data shuffling to synergistically integrate diffusion and autoregressive modeling, enabling efficient arbitrary-length generation; and (3) LV-Bench—the first fine-grained long-video benchmark—along with a quantitative long-range consistency metric. Experiments demonstrate that BlockVid significantly outperforms prior methods on VBench and LV-Bench, achieving +22.2% and +19.4% improvements in VDE Subject and Clarity, respectively, yielding markedly clearer and temporally consistent minute-long videos.

Technology Category

Application Category

📝 Abstract
Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.
Problem

Research questions and friction points this paper is trying to address.

Addresses error accumulation in long video generation
Overcomes lack of fine-grained long-video benchmarks
Improves temporal consistency in minute-long videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-aware sparse KV cache reduces error propagation
Block Forcing training strategy enhances temporal consistency
Chunk-wise noise scheduling and shuffling improves video coherence
🔎 Similar Papers
No similar papers found.
Z
Zeyu Zhang
DAMO Academy, Alibaba Group
S
Shuning Chang
DAMO Academy, Alibaba Group
Y
Yuanyu He
DAMO Academy, Alibaba Group; ZIP Lab, Zhejiang University
Yizeng Han
Yizeng Han
Alibaba DAMO Academy
Dynamic Neural NetworksEfficient Deep LearningComputer Vision
J
Jiasheng Tang
DAMO Academy, Alibaba Group; Hupan Lab
F
Fan Wang
DAMO Academy, Alibaba Group
Bohan Zhuang
Bohan Zhuang
Zhejiang University
Efficient AIMLSys