Block Cascading: Training Free Acceleration of Block-Causal Video Models

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Block-casual video generation models face a severe speed–quality trade-off: e.g., 1.3B and 14B models achieve only 16 FPS and 4.5 FPS, respectively. This work proposes a training-free, partial-denoising context-triggered parallel inference method that overcomes the serial bottleneck imposed by block-causal architecture. Our core innovation is to leverage partially denoised outputs from early blocks to proactively initiate generation of subsequent blocks—enabling temporal block-level parallelism. Combined with context reuse and KV cache optimization, this eliminates switching overhead. On a 5-GPU setup, full-model inference accelerates by ~2×: the 1.3B model reaches 30 FPS, and the 14B model improves to 12.5 FPS, with no statistically significant degradation in generation quality. To our knowledge, this is the first method to enable efficient parallelized inference for block-causal video models without fine-tuning or quality compromise.

Technology Category

Application Category

📝 Abstract
Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/
Problem

Research questions and friction points this paper is trying to address.

Accelerates block-causal video generation models without training
Enables parallel processing to overcome speed-quality trade-offs
Eliminates KV-caching overhead for interactive video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free parallelization for block-causal video models
Partially denoised context enables simultaneous block generation
Eliminates KV-caching overhead during interactive generation switches
🔎 Similar Papers
No similar papers found.