Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Diffusion Transformers (DiTs) incur prohibitive attention computation overhead when generating long videos—e.g., 500–600 PFLOPs in an 8-second 720p video, predominantly from attention operations. Method: We propose AdaSpa, a training-free, plug-and-play adaptive sparse attention method. Its core innovation is the first discovery that sparse attention patterns and the log-sum-exp (LSE) term remain invariant across denoising steps in DiTs. Leveraging this, AdaSpa introduces a hierarchical block-sparse framework combining dynamic pattern modeling with online precise index search, augmented by head-adaptive masking and LSE caching to accelerate index generation. Contribution/Results: Applied to models including HunyuanVideo, AdaSpa reduces attention computation by ~500 PFLOPs (83% reduction), significantly accelerating inference. It requires zero fine-tuning, zero training data, and strictly preserves generation quality.

Technology Category

Application Category

📝 Abstract

Generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency, primarily due to the computational demands of attention mechanisms. For instance, generating an 8-second 720p video (110K tokens) with HunyuanVideo takes about 600 PFLOPs, with around 500 PFLOPs consumed by attention computations. To address this issue, we propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method. Firstly, to realize the Dynamic Pattern, we introduce a blockified pattern to efficiently capture the hierarchical sparsity inherent in DiTs. This is based on our observation that sparse characteristics of DiTs exhibit hierarchical and blockified structures between and within different modalities. This blockified approach significantly reduces the complexity of attention computation while maintaining high fidelity in the generated videos. Secondly, to enable Online Precise Search, we propose the Fused LSE-Cached Search with Head-adaptive Hierarchical Block Sparse Attention. This method is motivated by our finding that DiTs' sparse pattern and LSE vary w.r.t. inputs, layers, and heads, but remain invariant across denoising steps. By leveraging this invariance across denoising steps, it adapts to the dynamic nature of DiTs and allows for precise, real-time identification of sparse indices with minimal overhead. AdaSpa is implemented as an adaptive, plug-and-play solution and can be integrated seamlessly with existing DiTs, requiring neither additional fine-tuning nor a dataset-dependent profiling. Extensive experiments validate that AdaSpa delivers substantial acceleration across various models while preserving video quality, establishing itself as a robust and scalable approach to efficient video generation.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational demands in long video generation.

Introduces adaptive sparse attention for efficient processing.

Maintains video quality while significantly accelerating generation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Pattern blockified approach reduces attention complexity.

Online Precise Search adapts to DiTs' dynamic nature.

AdaSpa integrates seamlessly without additional fine-tuning.

🔎 Similar Papers

Real-Time Video Generation with Pyramid Attention Broadcast