ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion Transformers achieve high-fidelity image generation but suffer from excessive memory consumption and high inference latency due to their massive parameter count, hindering practical deployment. To address this, we propose the first rotation-based 4-bit quantization method (W4A4) specifically designed for diffusion Transformers—requiring no retraining and enabling plug-and-play integration. Our approach applies group-wise Hadamard transforms to suppress outliers across both channel and sequence dimensions, and introduces the ConvLinear4bit module to jointly realize rotation, grouped quantization, low-bit GEMM computation, and dequantization. Evaluated on FLUX.1-dev, our method achieves a 2.26× inference speedup and 4.05× memory compression while preserving generation quality losslessly. The key contribution lies in pioneering the integration of rotation quantization into diffusion Transformer architectures—thereby circumventing conventional quantization’s reliance on restrictive weight distribution assumptions—and establishing a new paradigm for efficient large-model deployment.

Technology Category

Application Category

📝 Abstract
Diffusion transformers have demonstrated strong capabilities in generating high-quality images. However, as model size increases, the growing memory footprint and inference latency pose significant challenges for practical deployment. Recent studies in large language models (LLMs) show that rotation-based techniques can smooth outliers and enable 4-bit quantization, but these approaches often incur substantial overhead and struggle with row-wise outliers in diffusion transformers. To address these challenges, we propose ConvRot, a group-wise rotation-based quantization method that leverages regular Hadamard transform (RHT) to suppress both row-wise and column-wise outliers while reducing complexity from quadratic to linear. Building on this, we design ConvLinear4bit, a plug-and-play module that integrates rotation, quantization, GEMM, and dequantization, enabling W4A4 inference without retraining and preserving visual quality. Experiments on FLUX.1-dev demonstrate a 2.26$ imes$ speedup and 4.05$ imes$ memory reduction while maintaining image fidelity. To our knowledge, this is the first application of rotation-based quantization for plug-and-play W4A4 inference in diffusion transformers.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory and latency in large diffusion transformers
Enables 4-bit quantization without retraining for diffusion transformers
Suppresses outliers efficiently to maintain image quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Group-wise rotation using regular Hadamard transform
Plug-and-play module for W4A4 inference without retraining
Linear complexity reduction from quadratic for efficiency
F
Feice Huang
SIGS, Tsinghua University
Z
Zuliang Han
Central Media Technology Institute, Huawei
Xing Zhou
Xing Zhou
Computer Science, University of Illinois at Urbana-Champaign
Compiler Optimizations
Y
Yihuang Chen
Central Media Technology Institute, Huawei
L
Lifei Zhu
Central Media Technology Institute, Huawei
H
Haoqian Wang
SIGS, Tsinghua University