🤖 AI Summary
Diffusion Transformers (DiTs) suffer from high inference latency due to their iterative nature and compute-intensive GEMM operations. Existing quantization methods struggle to preserve generation quality while achieving meaningful acceleration at low bit-widths. To address this, we propose an algorithm-hardware co-designed mixed-precision MX quantization framework. Our approach introduces Microscaling—a novel, DiT-specific mechanism that dynamically identifies and preserves large-magnitude outliers in activation tensors via higher-precision quantization. Coupled with accuracy-aware GEMM and optimized MX format conversion, our method breaks the traditional accuracy–latency trade-off, maintaining zero FID degradation versus full-precision inference. We implement a custom hardware accelerator on an RTX 3090 GPU, achieving 2.10×–5.32× end-to-end inference speedup. This work delivers a scalable, high-fidelity acceleration solution for practical DiT deployment.
📝 Abstract
Diffusion Transformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiT quantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiT accelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiT delivers a speedup of 2.10-5.32 times over RTX 3090, with no loss in FID.