FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video diffusion models suffer from slow inference and high computational overhead, hindering practical deployment. While existing training-free acceleration techniques—such as quantization and sparsification—yield gains individually, their naive combination degrades performance significantly due to lack of joint optimization. This paper proposes a training-aware co-optimization framework integrating FP8 quantization and structured sparsity, specifically targeting acceleration of 3D bidirectional attention. We introduce three key innovations: (1) a 3D block-wise unified-granularity scheduling scheme; (2) denoising-step-adaptive error control; and (3) a FlashAttention-based fused kernel natively optimized for NVIDIA Hopper architecture. Evaluated on 720p video generation, our method achieves 7.09× speedup in attention kernels and 4.96× end-to-end inference acceleration, with zero degradation in generation quality—thereby substantially overcoming the efficiency bottleneck in video diffusion model inference.

Technology Category

Application Category

📝 Abstract
Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation due to the lack of joint optimization. We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features for highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09x kernel speedup for attention operations and a 4.96x end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution-without sacrificing generation quality.
Problem

Research questions and friction points this paper is trying to address.

Slow inference speeds in video diffusion models
High computational demands hindering deployment
Performance degradation from naive quantization-sparsity combination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified 3D tile-wise granularity for quantization and sparsity
Denoising step-aware strategy adapting to noise schedule
Hardware-friendly kernel leveraging FlashAttention and Hopper architecture
🔎 Similar Papers
Akide Liu
Akide Liu
PhD Student @ Monash University
Efficient AIComputer Vision
Z
Zeyu Zhang
DAMO Academy, Alibaba Group
X
Xuehai Bai
ZIP Lab, Zhejiang University
Yizeng Han
Yizeng Han
Alibaba DAMO Academy
Dynamic Neural NetworksEfficient Deep LearningComputer Vision
J
Jiasheng Tang
DAMO Academy, Alibaba Group, Hupan Lab
Y
Yuanjie Xing
DAMO Academy, Alibaba Group
J
Jichao Wu
DAMO Academy, Alibaba Group
M
Mingyang Yang
DAMO Academy, Alibaba Group
Weihua Chen
Weihua Chen
Alibaba DAMO Academy, previously NLPR, CASIA
Computer Vision
J
Jiahao He
Y
Yuanyu He
F
Fan Wang
DAMO Academy, Alibaba Group
G
Gholamreza Haffari
Monash University
Bohan Zhuang
Bohan Zhuang
Zhejiang University
Efficient AIMLSys