🤖 AI Summary
Block-casual video generation models face a severe speed–quality trade-off: e.g., 1.3B and 14B models achieve only 16 FPS and 4.5 FPS, respectively. This work proposes a training-free, partial-denoising context-triggered parallel inference method that overcomes the serial bottleneck imposed by block-causal architecture. Our core innovation is to leverage partially denoised outputs from early blocks to proactively initiate generation of subsequent blocks—enabling temporal block-level parallelism. Combined with context reuse and KV cache optimization, this eliminates switching overhead. On a 5-GPU setup, full-model inference accelerates by ~2×: the 1.3B model reaches 30 FPS, and the 14B model improves to 12.5 FPS, with no statistically significant degradation in generation quality. To our knowledge, this is the first method to enable efficient parallelized inference for block-causal video models without fine-tuning or quality compromise.
📝 Abstract
Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/