MixDiT: Accelerating Image Diffusion Transformer Inference with Mixed-Precision MX Quantization

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

230K/year
🤖 AI Summary
Diffusion Transformers (DiTs) suffer from high inference latency due to their iterative nature and compute-intensive GEMM operations. Existing quantization methods struggle to preserve generation quality while achieving meaningful acceleration at low bit-widths. To address this, we propose an algorithm-hardware co-designed mixed-precision MX quantization framework. Our approach introduces Microscaling—a novel, DiT-specific mechanism that dynamically identifies and preserves large-magnitude outliers in activation tensors via higher-precision quantization. Coupled with accuracy-aware GEMM and optimized MX format conversion, our method breaks the traditional accuracy–latency trade-off, maintaining zero FID degradation versus full-precision inference. We implement a custom hardware accelerator on an RTX 3090 GPU, achieving 2.10×–5.32× end-to-end inference speedup. This work delivers a scalable, high-fidelity acceleration solution for practical DiT deployment.

Technology Category

Application Category

📝 Abstract
Diffusion Transformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiT quantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiT accelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiT delivers a speedup of 2.10-5.32 times over RTX 3090, with no loss in FID.
Problem

Research questions and friction points this paper is trying to address.

Accelerate DiT inference with mixed-precision quantization
Reduce compute intensity of GEMM operations in DiT
Maintain high accuracy while achieving speedup
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-precision MX quantization for DiT
Algorithm-hardware co-designed acceleration solution
Precision-flexible multiplications in MixDiT accelerator