Block Cascading: Training Free Acceleration of Block-Causal Video Models

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

290K/year

🤖 AI Summary

Block-casual video generation models face a severe speed–quality trade-off: e.g., 1.3B and 14B models achieve only 16 FPS and 4.5 FPS, respectively. This work proposes a training-free, partial-denoising context-triggered parallel inference method that overcomes the serial bottleneck imposed by block-causal architecture. Our core innovation is to leverage partially denoised outputs from early blocks to proactively initiate generation of subsequent blocks—enabling temporal block-level parallelism. Combined with context reuse and KV cache optimization, this eliminates switching overhead. On a 5-GPU setup, full-model inference accelerates by ~2×: the 1.3B model reaches 30 FPS, and the 14B model improves to 12.5 FPS, with no statistically significant degradation in generation quality. To our knowledge, this is the first method to enable efficient parallelized inference for block-causal video models without fine-tuning or quality compromise.

Technology Category

Application Category

📝 Abstract

Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/

Problem

Research questions and friction points this paper is trying to address.

Accelerates block-causal video generation models without training

Enables parallel processing to overcome speed-quality trade-offs

Eliminates KV-caching overhead for interactive video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free parallelization for block-causal video models

Partially denoised context enables simultaneous block generation

Eliminates KV-caching overhead during interactive generation switches

🔎 Similar Papers

NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative