DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

📅 2026-05-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
This work addresses the significant degradation in image generation quality commonly observed when applying 4-bit quantization to Diffusion Transformers, which poses a critical challenge in balancing computational efficiency and model performance. The authors propose DiRotQ, a novel framework that uniquely integrates PCA-based orthogonal rotation with quantization: activations are projected onto a principal component subspace, where high-energy coefficients retain higher precision while the remaining components undergo 4-bit quantization. The method further incorporates GPTQ for weight quantization and employs custom Triton kernels to accelerate inference. To comprehensively evaluate perceptual quality and prompt alignment, the study introduces a VLM-as-a-Judge protocol. Experiments demonstrate that DiRotQ achieves an FID of 15.9 and PSNR of 19.1 dB on PixArt-Σ, outperforming SVDQuant, and reduces memory usage by 2.1× while accelerating inference by 2.3× on FLUX.1-dev.
📝 Abstract
Diffusion Transformers (DiTs) achieve state-of-the-art image generation quality but incur substantial memory and computational costs at inference. While aggressive Post-Training Quantization (PTQ) to 4-bit precision offers significant efficiency gains, it typically results in severe quality degradation. Existing approaches, including smoothing-based methods, mixed-precision schemes, rotation techniques, and low-rank residual methods, partially mitigate this issue but still leave a noticeable gap to FP16/BF16 performance. In this work, we introduce DiRotQ, a W4A4 PTQ framework that mitigates this degradation through rotation-aware activation quantization. DiRotQ identifies a low-rank subspace capturing dominant activation variance via Principal Component Analysis (PCA), preserving coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit. Activations are rotated into the PCA basis at inference time using calibration-derived orthogonal transformations, while the inverse rotation is fused into the layer weights offline. Combined with GPTQ-based weight quantization, DiRotQ achieves an FID (lower is better) of 15.9 and PSNR (higher is better) of 19.1 dB on PixArt-Σ over the MJHQ-30K dataset, outperforming the prior state-of-the-art SVDQuant (FID 18.9, PSNR 17.6) under the same INT W4A4 setting. Beyond standard metrics, we introduce a VLM-as-a-Judge evaluation protocol for diffusion model quantization, the first such evaluation in this setting, providing a more holistic assessment of perceptual quality and prompt alignment under aggressive compression. On the systems side, we implement a Triton-based custom kernel to enable efficient end-to-end inference, reducing memory usage of the 12B FLUX.1-dev model by 2.1x and delivering 2.3x speedup over the BF16 baseline, on a 24 GB RTX 4090 GPU.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformers
Post-Training Quantization
4-bit quantization
quality degradation
image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

rotation-aware quantization
diffusion transformers
post-training quantization
PCA-based activation compression
VLM-as-a-Judge