🤖 AI Summary
Diffusion models—particularly Diffusion Transformers—face significant challenges for on-device deployment due to their high memory and computational overhead. Post-training quantization (PTQ) further struggles with activation outliers, limiting its effectiveness for aggressive low-bit quantization (e.g., 4-bit). To address this, we propose HadaNorm: a fine-tuning-free linear preprocessing technique that jointly applies channel-wise mean centering and Hadamard transformation. This method effectively suppresses activation outliers across Transformer blocks, thereby enhancing PTQ robustness. Specifically designed for Diffusion Transformer architectures, HadaNorm consistently reduces quantization error across multiple components—including attention and feed-forward layers—without architectural modification or retraining. Experiments demonstrate superior accuracy-efficiency trade-offs over state-of-the-art methods, enabling, for the first time, practical 4-bit quantization of Diffusion Transformers suitable for efficient on-device inference.
📝 Abstract
Diffusion models represent the cutting edge in image generation, but their high memory and computational demands hinder deployment on resource-constrained devices. Post-Training Quantization (PTQ) offers a promising solution by reducing the bitwidth of matrix operations. However, standard PTQ methods struggle with outliers, and achieving higher compression often requires transforming model weights and activations before quantization. In this work, we propose HadaNorm, a novel linear transformation that extends existing approaches and effectively mitigates outliers by normalizing activations feature channels before applying Hadamard transformations, enabling more aggressive activation quantization. We demonstrate that HadaNorm consistently reduces quantization error across the various components of transformer blocks, achieving superior efficiency-performance trade-offs when compared to state-of-the-art methods.