🤖 AI Summary
To address the challenges of deploying video-generation diffusion Transformers (DiTs) on edge devices—namely, their excessive parameter count, high computational cost, and the poor generalizability of existing quantization methods—this paper proposes the first quantization-distillation co-design framework tailored for video generation. We introduce a Token-aware Quantization Estimator (TQE) to explicitly model and compensate for token-level quantization errors, and propose Temporal Maintenance Distillation (TMD), which leverages global spatiotemporal context to guide per-frame optimization and preserve inter-frame consistency. Employing W3A6 low-bit quantization (3-bit weights, 6-bit activations) with quantization-aware training, our method achieves state-of-the-art scene consistency (23.40) while maintaining generation quality and delivering a 1.9× speedup over the current best approach.
📝 Abstract
Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token-aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$ imes$. Code will be available at https://github.com/cantbebetter2/Q-VDiT.