TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the slow inference and deployment challenges of video diffusion models, this paper proposes TurboDiffusion—a holistic acceleration framework. It jointly designs low-bit SageAttention with trainable Sparse Linear Attention (SLA), integrates rCM-step distillation, and applies end-to-end W8A8 quantization. Through algorithm-system co-optimization, TurboDiffusion achieves 100–200× end-to-end speedup on a single RTX 5090 GPU, enabling efficient inference for multi-scale video generation models up to 14B parameters. Generated video quality remains on par with the full-precision baseline—no statistically significant degradation in FVD, PSNR, or SSIM—outperforming existing acceleration methods. Key contributions include: (i) the first hybrid attention mechanism combining low-bit precision and sparsity; (ii) a lightweight, video-diffusion-specific distillation paradigm; and (iii) a fully W8A8 quantized, hardware-aware deployment pipeline.

Technology Category

Application Category

📝 Abstract
We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.
Problem

Research questions and friction points this paper is trying to address.

Accelerates video diffusion models by 100-200 times while preserving quality.
Speeds up attention computation using low-bit and sparse-linear attention techniques.
Reduces model size and speeds linear layers via 8-bit quantization.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-bit SageAttention and Sparse-Linear Attention for faster attention computation
Step distillation with rCM to reduce required generation steps
W8A8 quantization to compress model and accelerate linear layers
🔎 Similar Papers
No similar papers found.