ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Diffusion Transformers achieve high-fidelity image generation but suffer from excessive memory consumption and high inference latency due to their massive parameter count, hindering practical deployment. To address this, we propose the first rotation-based 4-bit quantization method (W4A4) specifically designed for diffusion Transformers—requiring no retraining and enabling plug-and-play integration. Our approach applies group-wise Hadamard transforms to suppress outliers across both channel and sequence dimensions, and introduces the ConvLinear4bit module to jointly realize rotation, grouped quantization, low-bit GEMM computation, and dequantization. Evaluated on FLUX.1-dev, our method achieves a 2.26× inference speedup and 4.05× memory compression while preserving generation quality losslessly. The key contribution lies in pioneering the integration of rotation quantization into diffusion Transformer architectures—thereby circumventing conventional quantization’s reliance on restrictive weight distribution assumptions—and establishing a new paradigm for efficient large-model deployment.

Technology Category

Application Category

📝 Abstract

Diffusion transformers have demonstrated strong capabilities in generating high-quality images. However, as model size increases, the growing memory footprint and inference latency pose significant challenges for practical deployment. Recent studies in large language models (LLMs) show that rotation-based techniques can smooth outliers and enable 4-bit quantization, but these approaches often incur substantial overhead and struggle with row-wise outliers in diffusion transformers. To address these challenges, we propose ConvRot, a group-wise rotation-based quantization method that leverages regular Hadamard transform (RHT) to suppress both row-wise and column-wise outliers while reducing complexity from quadratic to linear. Building on this, we design ConvLinear4bit, a plug-and-play module that integrates rotation, quantization, GEMM, and dequantization, enabling W4A4 inference without retraining and preserving visual quality. Experiments on FLUX.1-dev demonstrate a 2.26$ imes$ speedup and 4.05$ imes$ memory reduction while maintaining image fidelity. To our knowledge, this is the first application of rotation-based quantization for plug-and-play W4A4 inference in diffusion transformers.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory and latency in large diffusion transformers

Enables 4-bit quantization without retraining for diffusion transformers

Suppresses outliers efficiently to maintain image quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group-wise rotation using regular Hadamard transform

Plug-and-play module for W4A4 inference without retraining

Linear complexity reduction from quadratic for efficiency

🔎 Similar Papers

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation