🤖 AI Summary
To address the challenge of efficiently deploying video diffusion Transformers (e.g., OpenSora) on resource-constrained devices, this work proposes a static quantization method that requires no dynamic quantization support. Our approach introduces a novel unified framework combining stepwise calibration, channel-wise weight quantization, tensor-wise activation quantization, and SmoothQuant—first applied to video diffusion models. It achieves FP16-level generation quality under static quantization, a previously unattained milestone. Experiments demonstrate that our method matches FP16 and dynamic quantization baselines (e.g., ViDiT-Q) in CLIP and VQA scores, preserving video generation fidelity without degradation. Moreover, it significantly improves inference efficiency on AI accelerators. This work breaks the long-standing performance bottleneck of static quantization for video generative models and establishes a viable pathway toward high-fidelity video synthesis at the edge.
📝 Abstract
Diffusion Transformers for video generation have gained significant research interest since the impressive performance of SORA. Efficient deployment of such generative-AI models on GPUs has been demonstrated with dynamic quantization. However, resource-constrained devices cannot support dynamic quantization, and need static quantization of the models for their efficient deployment on AI processors. In this paper, we propose a novel method for the post-training quantization of OpenSoracite{opensora}, a Video Diffusion Transformer, without relying on dynamic quantization techniques. Our approach employs static quantization, achieving video quality comparable to FP16 and dynamically quantized ViDiT-Q methods, as measured by CLIP, and VQA metrics. In particular, we utilize per-step calibration data to adequately provide a post-training statically quantized model for each time step, incorporating channel-wise quantization for weights and tensor-wise quantization for activations. By further applying the smooth-quantization technique, we can obtain high-quality video outputs with the statically quantized models. Extensive experimental results demonstrate that static quantization can be a viable alternative to dynamic quantization for video diffusion transformers, offering a more efficient approach without sacrificing performance.