Accelerating Text-to-Video Generation with Calibrated Sparse Attention

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-video diffusion models suffer from slow inference due to the high computational complexity of spatiotemporal attention. This work proposes CalibAtt, a training-free method that leverages offline calibration to identify block-level sparse and redundant attention patterns stable across inputs. By constructing layer-, head-, and diffusion-step-specific sparse attention operators, CalibAtt dynamically skips redundant connections during inference. It is the first approach to exploit query-stable block sparsity for hardware-friendly acceleration without requiring model fine-tuning. Evaluated on large-scale models such as Wan 2.1 14B and Mochi 1, CalibAtt achieves up to 1.58× end-to-end speedup while preserving generation quality and text-video alignment fidelity.

Technology Category

Application Category

📝 Abstract
Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.
Problem

Research questions and friction points this paper is trying to address.

text-to-video generation
diffusion models
spatiotemporal attention
computational efficiency
video generation speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse attention
video generation acceleration
training-free optimization
calibrated sparsity
diffusion models
🔎 Similar Papers
No similar papers found.
S
Shai Yehezkel
Apple
S
Shahar Yadin
Apple
N
Noam Elata
Apple
Y
Yaron Ostrovsky-Berman
Apple
Bahjat Kawar
Bahjat Kawar
CV/ML Researcher, Apple