Hardware-Friendly Static Quantization Method for Video Diffusion Transformers

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the challenge of efficiently deploying video diffusion Transformers (e.g., OpenSora) on resource-constrained devices, this work proposes a static quantization method that requires no dynamic quantization support. Our approach introduces a novel unified framework combining stepwise calibration, channel-wise weight quantization, tensor-wise activation quantization, and SmoothQuant—first applied to video diffusion models. It achieves FP16-level generation quality under static quantization, a previously unattained milestone. Experiments demonstrate that our method matches FP16 and dynamic quantization baselines (e.g., ViDiT-Q) in CLIP and VQA scores, preserving video generation fidelity without degradation. Moreover, it significantly improves inference efficiency on AI accelerators. This work breaks the long-standing performance bottleneck of static quantization for video generative models and establishes a viable pathway toward high-fidelity video synthesis at the edge.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers for video generation have gained significant research interest since the impressive performance of SORA. Efficient deployment of such generative-AI models on GPUs has been demonstrated with dynamic quantization. However, resource-constrained devices cannot support dynamic quantization, and need static quantization of the models for their efficient deployment on AI processors. In this paper, we propose a novel method for the post-training quantization of OpenSoracite{opensora}, a Video Diffusion Transformer, without relying on dynamic quantization techniques. Our approach employs static quantization, achieving video quality comparable to FP16 and dynamically quantized ViDiT-Q methods, as measured by CLIP, and VQA metrics. In particular, we utilize per-step calibration data to adequately provide a post-training statically quantized model for each time step, incorporating channel-wise quantization for weights and tensor-wise quantization for activations. By further applying the smooth-quantization technique, we can obtain high-quality video outputs with the statically quantized models. Extensive experimental results demonstrate that static quantization can be a viable alternative to dynamic quantization for video diffusion transformers, offering a more efficient approach without sacrificing performance.

Problem

Research questions and friction points this paper is trying to address.

static quantization for video diffusion

efficient deployment on AI processors

post-training quantization without dynamic techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Static quantization for Video Diffusion Transformers

Per-step calibration data utilization

Smooth-quantization technique application

🔎 Similar Papers

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation