MixDiT: Accelerating Image Diffusion Transformer Inference with Mixed-Precision MX Quantization

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion Transformers (DiTs) suffer from high inference latency due to their iterative nature and compute-intensive GEMM operations. Existing quantization methods struggle to preserve generation quality while achieving meaningful acceleration at low bit-widths. To address this, we propose an algorithm-hardware co-designed mixed-precision MX quantization framework. Our approach introduces Microscaling—a novel, DiT-specific mechanism that dynamically identifies and preserves large-magnitude outliers in activation tensors via higher-precision quantization. Coupled with accuracy-aware GEMM and optimized MX format conversion, our method breaks the traditional accuracy–latency trade-off, maintaining zero FID degradation versus full-precision inference. We implement a custom hardware accelerator on an RTX 3090 GPU, achieving 2.10×–5.32× end-to-end inference speedup. This work delivers a scalable, high-fidelity acceleration solution for practical DiT deployment.

Technology Category

Application Category

📝 Abstract
Diffusion Transformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiT quantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiT accelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiT delivers a speedup of 2.10-5.32 times over RTX 3090, with no loss in FID.
Problem

Research questions and friction points this paper is trying to address.

Accelerate DiT inference with mixed-precision quantization
Reduce compute intensity of GEMM operations in DiT
Maintain high accuracy while achieving speedup
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-precision MX quantization for DiT
Algorithm-hardware co-designed acceleration solution
Precision-flexible multiplications in MixDiT accelerator
🔎 Similar Papers
No similar papers found.
D
Daeun Kim
School of Computing, Graduate School of AI Semiconductor, Korea Advanced Institute of Science and Technology, Daejeon 34141, South Korea
Jinwoo Hwang
Jinwoo Hwang
Professor, Materials Science and Engineering, Ohio State University
Materials ScienceElectron MicroscopyNanotechnology
C
Changhun Oh
School of Computing, Graduate School of AI Semiconductor, Korea Advanced Institute of Science and Technology, Daejeon 34141, South Korea
Jongse Park
Jongse Park
Associate Professor; School of Computing; KAIST
Computer ArchitectureHW/SW CodesignAI SystemsAutonomous Systems