🤖 AI Summary
Diffusion Transformers achieve high-fidelity image generation but suffer from excessive memory consumption and high inference latency due to their massive parameter count, hindering practical deployment. To address this, we propose the first rotation-based 4-bit quantization method (W4A4) specifically designed for diffusion Transformers—requiring no retraining and enabling plug-and-play integration. Our approach applies group-wise Hadamard transforms to suppress outliers across both channel and sequence dimensions, and introduces the ConvLinear4bit module to jointly realize rotation, grouped quantization, low-bit GEMM computation, and dequantization. Evaluated on FLUX.1-dev, our method achieves a 2.26× inference speedup and 4.05× memory compression while preserving generation quality losslessly. The key contribution lies in pioneering the integration of rotation quantization into diffusion Transformer architectures—thereby circumventing conventional quantization’s reliance on restrictive weight distribution assumptions—and establishing a new paradigm for efficient large-model deployment.
📝 Abstract
Diffusion transformers have demonstrated strong capabilities in generating high-quality images. However, as model size increases, the growing memory footprint and inference latency pose significant challenges for practical deployment. Recent studies in large language models (LLMs) show that rotation-based techniques can smooth outliers and enable 4-bit quantization, but these approaches often incur substantial overhead and struggle with row-wise outliers in diffusion transformers. To address these challenges, we propose ConvRot, a group-wise rotation-based quantization method that leverages regular Hadamard transform (RHT) to suppress both row-wise and column-wise outliers while reducing complexity from quadratic to linear. Building on this, we design ConvLinear4bit, a plug-and-play module that integrates rotation, quantization, GEMM, and dequantization, enabling W4A4 inference without retraining and preserving visual quality. Experiments on FLUX.1-dev demonstrate a 2.26$ imes$ speedup and 4.05$ imes$ memory reduction while maintaining image fidelity. To our knowledge, this is the first application of rotation-based quantization for plug-and-play W4A4 inference in diffusion transformers.